A new era of resilience

The implications of changing regulations in Europe – and how financial institutions can best respond

Originally published by Thomson Reuters © Thomson Reuters.

At the end of October, WhatsApp faced its longest and furthest-reaching outage. Without their favourite group chat, millions of people worldwide suddenly found themselves cut off from the latest news from friends, family, and colleagues.

While the outage was being widely reported – and tweeted – even irregular or non-users were aware that there was a problem. At the same time, Apple also experienced a brief outage in its own iMessage and FaceTime services.

These outages are a stark reminder to businesses – if one was needed – that any unexpected system downtime can have far-reaching consequences. Reputation is the immediate and obvious casualty. But operational, productivity and financial losses are not far behind.

In its 2022 Outage Analysis Report, the Uptime Institute found that more than 60 percent of outages now cost more than $100,000 each – compared to 39 percent in 2019. Among them, 15 percent cost more than $1 million (up from 11 percent in 2019.) For financial services businesses, that cost can reach millions of dollars a second.

Regulators respond
Given the deep impact that outages can have, it’s not surprising that both the FCA and the EU are strengthening their approaches to operational resilience. The UK’s Financial Conduct Authority (FCA)’s regulatory framework for operational resilience for financial institutions has been in force since March of this year.

Firms must identify any vulnerabilities in their operational resilience, establish the maximum disruption they are willing to endure (and set impact tolerance limits), and carry out any mapping and testing necessary to achieve all this. And although a three-year transition period gives firms until 31st March 2025 before they have to operate consistently within those impact tolerances, they are expected to be sufficiently prepared for disruptions such that they can respond effectively and restore services efficiently before then.

International standardisation
Meanwhile, the EU’s own regulatory bodies have just implemented their own response: the Digital Operational Resilience Act (DORA). In this case, firms have two years to comply but, in that time, they will need to identify any compliance gaps in their ICT systems, determine which of their third-party providers will be considered critical vendors, and map their level of risk. They will also need to implement a testing framework for digital resilience, determine whether their current recovery strategies are in line with new standards – and put plans in place to improve them where needed.

DORA will bring EU institutions closer to UK requirements for bolstering operational resilience. Other regulatory bodies are likely to follow their lead, and if they can get their compliance in order now, any international firms operating in the EU will be ahead of the game as similar standards are set elsewhere.

Implicit in both sets of regulations is the fact that a company cannot be operationally resilient if it does have robust cybersecurity. Indeed, the two are symbiotic: the right systems to ensure operational resilience will, in turn, boost cybersecurity.

Cybersecurity and monitoring
With the regulatory impetus now in place, the race is now on to develop robust policies and procedures to ensure maximum preparedness for disruptions – and so to prevent themselves from becoming the next high-profile of unplanned system downtime. These include:

Nominate a Chief Resilience Officer, so that there is personal liability for poor operational resilience (as per senior management requirements like SMF24). Empower the CRO to take action where necessary.
Define goals for performance and availability, identify transaction flows to target, and remove any existing points of weakness.
Understand the relationship between performance and uptime, so that degrading performance levels can be quickly identified and fixed.
Consider machine learning to analyse the entire IT estate, and gain an accurate picture of the load different business transactions put on applications and infrastructure.
Optimise cloud use by analysing workload behaviour and demand profiles to ensure cloud is employed in a cost-effective and efficient way.
Pre-test limits to establish both the overall capacity limit of the IT estate and the bottlenecks and pinch points that affect performance.
Integrate security into operations for top-down and bottom-up operational resilience. Make cybersecurity training a key component of any onboarding and professional development to avoid staff creating or inviting new vulnerabilities – especially in hybrid working environments.
Adopt a Zero Trust approach to cybersecurity: no software can be fully trusted, and neither can any user. Make sure security measures are not just about protecting the perimeter of the estate, but limit authorised users’ activities.

Most of all, firms should build a robust, real-time monitoring function that works across all the various platforms, systems and technologies contained within their infrastructure. Ensuring and reporting on the resilience of today’s diverse and complex IT estates is no easy task – and not one for anything but the most sophisticated tools.

That’s because firms need a complete view of all critical business services, plus that of third parties, to identify and mitigate problems before they occur – and then to track and immediately resolve any issues that do slip through. Without effective monitoring, all of the above actions will eventually have limited long-term effect. And in fact, compliance with DORA in particular requires firms to implement management systems that can monitor, describe, and report any major ICT-based incidents.

Bigger pictures
With all this in mind, financial firms would be forgiven for focusing exclusively on how to achieve operational resilience, while losing sight of why it is so necessary. As with any regulatory response, there is a real risk that it becomes a box-ticking exercise, a one-off response to a new regulation to be marked ‘DONE’ once completed and ignored thereafter.

But this would be a mistake. The new regulations are intended not just to protect individual institutions and their customers – essential though that is. They are also there to protect the financial system as a whole and the wider economies in which they operate. Indeed there is a strong argument for going further than legal mandates and adopting a ‘best practice’ rather than ‘minimal practice’ approach. In an interconnected marketplace, where multi-directional and integrated supply ecosystems are increasingly taking the place of linear supply chains, no business can afford a system outage. Even if it happens to someone else.