How to manage the cost of data storage
Storage costs of the data generated from and around your IT can be punishing, but with intelligent data management you can control and reduce these costs.
Excellent performance of your IT systems is essential for staying competitive. This requires data, often vast amounts and a huge variety from and around servers and applications. This needs to be captured and stored and inevitably leads to higher storage costs - which continue to grow exponentially.
Typically, monitoring and observability platforms store data from monitoring tools (metrics, events, alerts), distributed traces from APM tools or OpenTelemetry and log files from applications, containers, platforms and operating systems. Traces and logs generate very large amounts of data quickly. When combined with the other types of data needed, they require very large storage capacity, even with compression.
If your technology is being used in the public cloud or is in a SaaS based tool, the cost of the data will soon become prohibitively expensive. Most SaaS vendors are now making data storage a cost item to encourage customers to actively manage their data (or pay for it!). This requires intelligent data management. What to store and for how long Intelligent data management starts with “triage” of your data.
There are three types of data to think about as you plan your data storage:
1. Data which is being captured but isn’t relevant for an observability platform and its intended uses. This data should be filtered out as it is received and not stored.
2. Data which is used in machine learning applications such as capacity planning or noise reduction. This data should be stored and held long enough to support the machine learning algorithms learning the patterns. (The schema is generally known in advance.) This data is typically stored for weeks or months to support the ML models.
3. Data which is part of an observability model (such as the OpenTelemetry model or Parquet model) but is used for short term tasks. This data will include traces and spans for site reliability engineering (SRE) to understand application behavior and data stored to support incident investigation and root cause analysis. This data is generally held for a shorter period of time - days or weeks. This separation of data by its future use is vital to reduce the wastage of “stored but never used” data.
How to separate the data
The filtering can be done at the source, either at the edge or at the observability platform. It is probably easier to do it at the platform level since the platform knows what data it can and can’t sensibly store. This requires a powerful data-processing engine to be part of the observability platform. It needs to be easy to use and enable control of what is stored.
When data is used in an ML based application, it should be flagged so that the appropriate storage policy can be set up. If it is not flagged as data that is part of an ML application, then it is general observability data and can have a different shorter-term data management policy.
Data should be captured and initially stored at full fidelity (sometimes called raw data). While averages, aggregates and extremes should be calculated, the original raw data should be retained for a reasonable period or until a policy states that it is ready for archival (logs often have to be kept for many years, but an observability platform is not the right place) or deletion.
Many observability platforms don’t have good compression or data management capabilities, so they aggregate very quickly and delete the raw data within minutes, hours or days. This is called time buckets but it loses the detailed information very quickly.
Please feel free to contact us to discuss your observability storage policies and costs.
Click here for more information