Avoiding poor production stability monitoring

Top 4 pitfalls that cause poor Production Stability Monitoring in an IT and Application estate


Over the course of my career I’ve seen 1000’s of other people’s monitoring solutions, along with their operational structures. I’ve delivered 1000’s of monitoring solutions myself and designed and run multiple monitoring operations in all Run, Change and Improve capacities. 

When I consider what’s the difference between those that I’ve seen that are successful to those that are not, I believe all reasons can be attributed to 4 root causes.

They apply to any toolset or technology used; the best tools in the world used poorly will deliver poor results.

I wanted to share these root causes as I believe they are often overlooked. Either due to being unknown or known but not valued.

1.     Business Analysis

60% of IT projects fail - at best!

There are various reports showing such statistics 

We often hear, “we know our apps best, so we are best placed to deliver monitoring”. Unfortunately, we usually see this approach deliver less then optimum results. Deliverables are based on preconceived requirements and preconceived feasibility. This is understandable, as users are usually in the weeds of their world. As with anyone “in the weeds” of any subject, it can be difficult to step back and challenge the process. 

An app SME is vital to delivering successful monitoring. However, it takes Business Analysis skills to identify needs, root causes and subsequent requirements. Business Analysis is a skill in itself (as much an art as a science). It’s the right requirements that ensures a successful implementation delivers value.

2.     Monitoring Domain Knowledge

We often see poor monitoring practice in a monitoring solution. Whilst the requirement may be met, there are often better ways to do it. Resulting in poor time to issue identification, time to issue communication or cause and effect gaps.

What reasons are given when we ask? “we never thought of that” or “it was easier to do it this way”

Simple practices can really help reduce these negative effects. Two easy ones: 1) Always monitor as close to the primary source as possible (often not log files!) 2) Monitor both cause and effect because both can happen independently (even if they shouldn’t!). 

3.     Time

It takes time to deliver and keep monitoring up to date. AI helps, DevOps helps, but it still takes time. We often hear, “we’re working on that”, then 6 months later we go in for another visit and it’s still not done. The reason is always the same, “other priorities got in the way”.   

4.     Monitoring Tool Knowledge

If what you’re monitoring is complicated, then your monitoring will be complicated – fact! There are often multiple ways a monitoring tool can meet a requirement. We often see the wrong choice implemented because “I didn’t know that made a difference”. This usually gets discovered after an issue. 

Our research shows poor monitoring solutions and operations increase yearly monitoring costs by 43%. This does not include business outage or reputational cost. Surely that’s a business case to address the pitfalls.