Using log file analytics to anticipate & investigate issues in an FX dealing platform
By John Cancio, MS Consultant, ITRS Group
Log file analytics use cases are numerous in financial services. An interesting one, which we have recently been working on with a global investment bank, is highlighting how order volume impacts the performance of their FX dealing platform. In other words, the latency of the components, i.e. the trade engine, within the platform. At the same client, we have also been able to show the “normal” (or expected) behaviour of said system and streamline the incident investigation process in the event of an issue.
To do so, we’ve been using ITRS Insights, a big data analytics tool, capable of leveraging the data in log files to extract actionable information on the performance of an IT estate and the business processes that depend on it.
How does order volume impact system performance (latency)?
Here, the objective is to show the relationship between two different data streams. We joined the two data streams through a common characteristic: order ID. We then compared order volume against the performance of the transport, using two specific metrics: the time-of-service and throughput rate.
In the immediate screenshot below, you can see order volumes (solid line) plotted against time-of-service (dotted line) for a single day. The time-of-service is given by two values, the mean and the standard deviation. No surprises here: as order volumes rise, so does the time-of-service.
Figure 1: Order volume impact on time-of-service
Although the relationship between order volume and system performance is as expected, i.e. positively correlated, being able to visualise business data alongside IT data has never been this accessible. Previously, DevOps would have to jump into different systems to extract chunks of data and pull it all together into a spreadsheet; a very time consuming exercise.
What does normal system behaviour look like?
Given our data set, we have attempted to show the normal, or expected, behaviour of the system in the following screenshots using order volume vs. time-of-service.
In Figure 2, we can see what happens in the dealing system during the trading day. There are occasional spikes in time-of-service (dotted lines) but it remains mostly stable. The points of interest, which would prompt further investigation, are those where time-of-service spikes even though order volumes (solid line) remain constant, i.e. 3am.
Figure 2: Dealing system normal behaviour
In establishing normal system behaviour we are able to see potential breakpoints and, in turn, better mitigate operational risk.
Fast investigation of potential issues
In explaining this event, a spike in time-of-service, it is important to investigate the behaviour of subscribers (clients). Figure 3, a bubble graph, shows us the mean (X axis) against the standard deviation (Y axis) of the time-of-service, while the size of the bubble represents the subscribers’ order volume. Interestingly, if you look in the upper right hand quadrant, two subscribers’ values spike even though their volumes remain similar to others.
Figure 3: Time-of-service for each subscriber
Next, we compare both subscribers’ behaviour, defined here by order volume, to other subscribers (Figure 4).
Figure 4: Orders by subscriber
This reveals that both subscribers, sub 3 & sub 9, place a high number of orders during the window in question. Their behaviour, therefore, looks to play a major role in determining time-of-service.
These findings aren't definitive but we have managed to create a robust workflow that aides the investigation process of potential breakpoints in a system using a single tool: ITRS Insights. The journey comes full circle when we feed the data from Insights to Geneos for real-time monitoring, where alerts are raised when latency crosses a certain threshold.
Figure 5: Orders crossing a latency threshold
To arrange a demonstration of ITRS Insights, our real-time analytics & big data storage platform, click here.