DevOps + SRE Story

There is confusion around the terms DevOps and SRE (Site Reliability Engineering). Indeed there is enough confusion to cause debate about what each actually is, whether they are new and what the differences are between them.

This blog is about DevOps and its widespread adoption in Enterprise IT, but there are enough underlying similarities between the two that it seemed odd to mention one without the other.

So what is DevOps?

In some ways it’s fairly easy to answer, namely cross team collaboration between Development and Operations teams. That already hints at a different kind of organisation than the traditional IT organisation where software is thrown over the fence to operations who then to manage it. But it does not go anywhere near far enough and my view is that this is not the most important question.

We should instead be more interested in:

What problems does DevOps helps us solve and why haven’t we solved them already?
How does it do this?
What measurable outcomes are being achieved in the wild?

Let’s understand the modern context. In today’s economy IT companies are primarily concerned with fast growth with acceptable levels of risk in the presence of uncertainty. High performing IT companies tend to exert market pressure over their more traditional competitors by being more adept at continually responding to the changing demands of their customers without compromising quality.

DevOps an approach that has emerged in high performing IT companies to help address these challenges by taking a holistic approach across the entire IT value chain and suggesting ways of making it faster and more effective by way of concrete practices and organisational structures.

Many important practices emerge as a result, with the following two being of particular interest:

Optimise the entire IT value chain. Practices such as Continuous Integration (CI) and Continuous Deployment (CD) are often employed to help achieve this. The result is some form of automated Development or DevOps pipeline. Such pipelines rely heavily on automated testing with a high level of coverage.
Never propagate failures downstream. CI/CD pipelines often have multiple gates which seek to catch and block test failures as early as possible.

How can ITRS help with any of this?

Traditionally, ITRS is a provider of real time, enterprise, monitoring solutions for the always on enterprise and as such has focused mainly on solutions for IT Operations (i.e. in the context of SRE). With the increasing adoption of DevOps across the enterprise we realise we can provide a better wholistic customer experience if we also address the needs of Developers and Development pipelines.

Infrastructure and application Observability is a key characteristic of complex, often hybrid environments (mix of on-premise, private and public cloud) and SRE teams need tooling which can provide this in an effective way. Observability is generally thought to be a convergence of difference kinds of data - metrics, logs, events, traces – which together can lead to greater insight into system health and performance as well as aiding root cause analysis and early prediction of failures before they occur.

In order for systems to be Observable in production , Observability must be considered early in Development pipelines rather than being relegated to an afterthought and the responsibility of SRE teams. Yet, the complex tools that SRE teams rely on are generally not available – at least easily – in Development environments. Our view is that since observability is such a key requirement for SRE teams, then applications which do not meet Observability requirements should not be deployed into production. That is, we see a parallel with the common practice of automated testing being applied to Observability in CI/CD pipelines.

In order to make this the case then Developers need tools which:

Are easy to install/use during Development
Provide similar capabilities to the tools used in production by SRE teams
Can be used to achieve some form of automated observability

Despite the trend towards Dev/Prod parity enabled by DevOps and “infrastructure as code” practices, one often overlooked aspect during Development is application performance under the kind of load expected in Production. Automated performance testing is therefore just as critical as functional testing and so should be treated as a first order concern in such a way that it too can be integrated into CI/CD pipelines.

ITRS is providing such tools which help to close the Observability and performance testing gaps in typical CI/CD pipelines:

A fully fledge backend-in-a-box which gives developers access to the same tools as their SRE counterparts use in production, including web UIs, configuration and query APIs, alert generation, etc.
Harmonised data collection agent for the different kinds of data which enable Observability: metrics, logs, events and (future) tracing.
Programming libraries that can be used instrument applications for publishing metrics to the data collection agent. Multiple language binds are supported (Java, Python) with more on the way.
Scriptable application load testing which integrates directly into common CI/CD tooling.

In this way Developers can guarantee that when their applications go into production they are already Observable, have been tested under realistic load and thus the responsibility of these key concerns is shared between Development and Operations teams – hence DevOps.