The role of SRE has been debated and augmented for years now. We’ve seen the AIOps movement come and go, the NoOps movement cause some serious debate, and the rise of the platform team take serious hold over the past few years. That said—no one could have expected the speed of the recent AI boom, and it’s only natural that site reliability engineering is significantly impacted. From firefighting to being on call to managing complex systems, AI obviously has a role to play in assisting companies and ensuring their services are more reliable.
And while few businesses are ready to totally offload their production reliability and customer success to a 3rd party AI vendor, we are at a stage in the AI hype cycle where there is real value across root cause analysis, incident response, making alerts actionable, and other areas, amongst all the hype and fluff. Here, we highlight 5 AI SRE startups really making a difference.
1. Causely
Causely
Causely is an AI SRE that specializes in root cause analysis powered by their unique causal reasoning system. Now more than ever, engineering teams are drowning in too much data and too many alerts. Open source projects like OpenTelemetry and observability vendors alike are certainly indispensable in the modern engineering stack, but the fact remains that they are generating a ton of noise and cost for businesses, and a lot of manual toil to make sense of said noise. Having a platform like Causely on the market—founded by a veteran of the industry with two prior startups in IT Operations—is a natural and critical response to the rise of AI code-generation tools that are shipping code faster than humans can reasonably understand it or manage it.
2. Resolve.ai
Resolve.ai
Resolve.ai focuses on automating incident response and resolution for complex systems. Their platform uses machine learning to predict potential failures before they impact users and automates troubleshooting workflows. This reduces downtime and manual intervention, ensuring AI systems are resilient and always accessible.
3. Ciroos
Ciroos
Ciroos offers a comprehensive observability platform tailored for AI workloads. It consolidates logs, metrics, and traces from distributed AI infrastructure, providing real-time visibility into system health. Their tools facilitate faster debugging, capacity planning, and performance optimization, which are critical for maintaining reliable services.
4. Traversal
Traversal
Traversal is dedicated to enhancing the scalability of AI models and infrastructure. They provide solutions for dynamic resource provisioning, load balancing, and traffic management, ensuring modern applications can handle real-time demand without sacrificing reliability. Their approach helps organizations deploy AI solutions that are both robust and flexible.
5. Parity
Parity
Parity allows teams to simulate real-world scenarios, perform stress testing, and validate model robustness under various conditions. This ensures that AI systems operate safely and dependably in production environments, with a particular emphasis on Kubernetes and incident response.
These startups are leading the way in AI SRE. Of course, established players like Datadog will offer solutions in this category to check the box, but the real innovation will be amongst the startups cutting their teeth in the new frontier of agentic and autonomous systems.