How a greater focus on operational resilience could have reduced the number of outages across financial services in 2020

Throughout 2020, outage numbers increased dramatically across the financial services sector, affecting retail banking, trading platforms, exchanges and more, impacting even organisations as institutionalised and established as the Federal Reserve. But how can financial services firms avoid such failures? Guy Warren, CEO at ITRS Group explores the importance of operational resilience in preventing future outages, and how complete system oversight is not just a nice-to-have, but a non-negotiable requirement for firms who wish to stay in the game in the post-pandemic landscape.

In recent years, operational resilience – that is, the ability of a firm to absorb and adapt to market volume spikes, failures, incidents or degraded performance – has been climbing the list of priorities for CIOs of financial institutions and regulators alike. And as the COVID-19 pandemic took global hold and undermined the stability of countries and sectors around the world, the urgency of this process was, unsurprisingly, further amplified. But despite significant strides being made amongst corporations in the race to operational resilience, the market volatility that has defined the past year has shone a light on just how far firms still have to go to become truly operationally resilient.

Outage acceleration

Throughout 2020, firms across the financial services sector suffered from outages. Nobody was immune, with exchanges, asset managers, retail trading platforms, among others all having been impacted. Most recently, however, was the outage that raised arguably the most questions and concerns on Wall Street and, indeed, across the globe: the US Federal Reserve (Fed) outage.

In February 2021, for four hours, the Federal Reserve systems that execute millions of transactions a day, encompassing everything from payroll to tax refunds to interbank transfers, were disrupted by what appeared to be some sort of internal glitch. Given that this outage came in the wake of two significant disruptions to the Fed’s payment services in 2019, the failure raised significant questions about the operational resilience of the infrastructure upon which all of the U.S. relies to process payments.

When it comes to the root cause of the problem, the Fed has remained tight lipped. But given the scale of the outage – it affected both the automated clearinghouse system, FedACH, and the Fedwire Funds interbank transfer service – we can assume that this wasn’t simply a failure within an isolated system, but, rather, was an issue with a core part of the Fed’s IT infrastructure.

Getting it right

However, in order to understand how the Fed could have avoided this failure – and, importantly, how it can avoid future similar failures – we don’t need to know exactly what went wrong. Central to any strategy that aims to minimises outages must be comprehensive IT monitoring. By affording internal oversight into their entire IT estate, IT monitoring allowing firms to quickly identify and resolve potential issues before they cause outages.

But as companies further enhance their digital services, their estates grow even more complex. As most monitoring tools and solutions are typically tailored to certain systems or processes, this means that companies will require an increasing number of these tools. A possible consequence of this is firms not having a total overview of their entire system. Rather, they have just visibility on the individual parts, but no single pane of glass across the whole estate; applications, infrastructure and data. This means that if a problem occurs in one system, they will be unable to track its effect across their entire estate.

A simple solution: complete system oversight

The solution to this is surprisingly simple: a single monitoring tool that compiles all of the different tools into a unified view. If a problem then occurs, the IT manager can identify the source, the underlying cause and the affected areas, allowing for a solution to be identified faster and more accurately.

In addition to complete system oversight, capacity planning must also be a priority for CIOs. Many of the outages that have occurred over the last 12 months have resulted from companies offering new digital services, while not knowing how much traffic they can handle in a certain timeframe. Capacity planning at a basic level allows firms to identify what a system can handle. At a more advanced level, it can identify specific pinch points, as well as model future scenarios, giving CIOs crucial insight into how their system handles them.

Firms are no longer able to simply apologise for an outage and move on. Not only are regulators cracking down, but customers are more willing than ever to switch. The current landscape is pivotal, and firms must get ahead of the curve now, or risk falling behind permanently.