What's the point of monitoring your monitoring services?
Having lots of IT often means having lots of monitoring tools. If one of these tools is not working, and cannot relay information from a crucial transaction application, how would you know?
The transaction could fail, and money lost - before you even knew something was wrong. This is why you need a monitor-of-monitors.
Financial services institutions have lots of monitoring tools for their sprawling IT environments to ensure availability of their business services. This is done by checking the underlying technical services enabling the business at regular intervals.
Most mission critical financial services – from brokers to banks and clearing firms - deploy tools to monitor their complete IT estate. This can include infrastructure (hardware, networks, storage), middleware, databases, applications as well as end-user monitoring.
The availability and performance of their business services play a key role in instilling confidence in their customers that they can perform transactions effectively on their platforms. As IT services grow more complex, spanning from on-premises to the cloud, the potential for IT service disruption, and the associated costs, increases.
Any IT service disruption or outage may have stark implications on not just their revenues, but also their reputation. If an incident disrupts service, they face an uphill task regaining the investor trust – and also face regulatory inquiries and fines.
What can go wrong?
IT service monitoring plays a very important role in avoiding an outage. Whenever an outage does happen, it is likely due to one or more of the following:
- Service was not being monitored (not configured/outdated monitoring)
- It was configured but not running (script error/config error)
- No alerts were configured even though monitoring was being done
- Alerts didn’t catch the attention of the operator or were lost among too many alerts, or a "sea of red"
Thus, you can see why it is critical that you monitor the health of the monitoring system itself to avoid it being one of the root-causes for an outage.
How to monitor your monitoring tools
We recommend the following five basic checks to monitor your monitoring.
1. Are all your monitoring systems working?
Apply checks on availability of monitoring for all services to make sure they are working at all times. This can be done by applying a simple severity rule on sampling status of all services being monitored. It can be then checked through the sampling status that it is indeed being monitored.
2. Check if all applications and physical/virtual servers are monitored
Check if all the configured application services are covered in monitoring. Note that there may be more than one application service on a single server. Check if all the physical /virtual servers are covered in monitoring.
3. Check license validity
Digital certificates verify the identity of the sender/receiver of an electronic message to protect your website, network or devices. Every certificate has an expiry date written into it. But if it has expired, there is often no way to tell until the damage is done. There needs to be a way to check – and fix – digital certificates that are about to expire. Monitoring can help.
4. Visibility of monitoring health
Based on monitoring alerts, various troubleshooting decisions like restarting a process, restarting a module or fail-over to backup are taken during incident. So, it is important that the health status of the monitoring estate is available to all who take these decisions.
This can be done by having a placeholder for the underlying monitoring health on the mission critical dashboards itself. Thus, the decision maker knows if he is relying on the correct monitoring data or if there is a break in monitoring services which may be resulting in the alert.
Additionally, a one-second ticking date time also assures that the dashboard state is latest and not affected / screen freeze due to a local workstation issue.
5. Monitoring reporting and audit
It is important that the monitoring team publishes to all stakeholders daily/ weekly/monthly reports on:
• Lists of servers covered in monitoring
• Lists of applications covered in monitoring
• Lists of existing issues in monitoring
• Lists of critical, warning alerts per application, per server
• Lists of alerts disabled or snoozed
• Lists of alert receipts configured (email & mobile).
It is then expected from the stakeholders to pinpoint any gaps in the configured monitoring.
At ITRS, we have been partnering with mission critical financial enterprises to continuously mature the monitoring templates for the ongoing transition of enterprise datacenters to hybrid IT. Despite the rapid changes, the core principles of effective monitoring and observability have stood the test of time.
With ITRS Geneos you can monitor and contextualize everything in one single tool, from legacy systems to cutting-edge new technology, from applications, servers, VMs, databases, middleware and cloud services to containers.
Learn more about ITRS Geneos by clicking below.