How to save on cloud costs with workload management – Part 1: Idle time management
Up to 35%of your cloud bill is waste. You're not alone, though. Almost every single user of the cloud is spending significantly more than they need to be. Infrastructure as a service (IaaS) remains the fastest growing component of the cloud and IaaS alone is set to pass 50 billion dollars per year in a market worth over 260 billion dollars. That's at least 17.5 billion dollars wasted globally in IaaS alone.
The most common source of this waste is having machines run when they don't need to. Every hour, minute or even second a machine runs unnecessarily will be money down the drain.
Cloud providers offer discounts if you commit to running your workloads for a long period of time, but, even if this may seem the quickest and easiest way to reduce your bill, a discount on server time that wasn’t required can be a false saving and can still lead to you paying more than you need to in the long run.
If you buy Reserved Instances (RIs) for idle machines, you'll see an immediate discount on your bill, but you will have bought RIs for machines you shouldn't have. Long term, this is going to increase your costs as you find yourself having committed to paying for capacity you clearly didn't need. The cloud is about elasticity, it's about capacity on demand - the capacity you need when you need it. So, a better long-term approach is to first understand your demand profile across all your workloads and switch off machines when they don't need to run. Then, once you've done that, analyse the remaining workloads and determine if they are optimally configured or rightsized.
Only after completing this good housekeeping, should you consider reserving instances and buying into savings plans.
Get the measure of it
Telling someone to turn a machine off is easy but you need to be sure the evidence is there to support the change.
Some tools will use average CPU over a time period and, if that average is low, then they will suggest that machine can be switched off. However, a machine could be highly active for 5 minutes every hour and using an average in this way will miss these important peaks.
Some tools might let you analyse peak times, as well. However, this is extremely sensitive. Something as simple as a weekly virus scan on an otherwise idle machine would mean this machine would not be identified as idle. This means you need access to a full set of statistical measures of demand for every instance.
Looking at average and peak isn't enough. You need to know what the machine was doing 99%, 95%, 75% of the time and you need to be able to base thresholds on that. At ITRS, we build a detailed understanding of server activity at very fine-grained data levels.
The first chart below shows a time series for an instance running for six months. The bottom chart shows a 6-month statistical summary expressed as a 'boxplot', which the variance of this metric over the modelling time period. You can quickly see from this that for 95% of the past six months this machine has been at 25% utilisation or lower.
On the bottom right, you can also see the daily boxplot statistical summaries. These levels of summary are available for configurable periods of time, giving you the control to base idle time analysis on business-critical time frames, ignoring overnight quiet times or weekends.
You need full configurability over the definition of 'idle'. Only with the full picture of utilisation across multiple metrics and enough configuration levers to pull to categorise thresholds will you be able to get a complete picture of what machines can be classed as 'idle'.
You know your applications better than anyone, so you know which of those should have their idle times based on peak, which can be based on the 99th or 95th percentiles, and which could be safely based on average. That means that when the system determines a machine is idle, low or high activity, you can be confident that’s the case.
At ITRS, we provide this detailed 'demand profile' of activity for every machine that has been running, even if it is now terminated.
See the big picture
We present this in a number of ways, but one of the more visual approaches is the use of what we call 'the Timeburst'. This gives you a unique visualisation showing every instance over a three-month window, their levels of activity and the costs associated with running them at that point in time.
The left-hand side categorises each of your instances in a configurable way. In this example, we've categorised it by 'State' (i.e., if the machine is running or has been terminated or stopped) and 'Activity' -that is how active that machine is during its run time.
The main part of the visualisation is a heatmap showing all instance activity over a period of time, in this case, four weeks. There are around 10,000 instances shown here. It even shows you those machines that have been terminated or stopped. Using the sliding colour scale, you can see below the heatmap, those areas of blue are times of very low levels of CPU activity. The areas that are shown in black indicate those times when instances were not running and, as a result, not incurring any on-demand costs.
The chart along the top shows the count of machines running at any point in time and the costs of those running machines. From this picture, you can quickly see a relatively reliable weekly pattern of activity. You can see half of the machines shown being turned off at the weekend. In the heatmap, I can also see peaks of activity for short periods at the weekend and lots of low levels of activity. Lots of idle time.
You can focus purely on the machines that are currently running but are categorised as 'idle' based on what has been configured as your definition of idle. You can see below that the ‘Timeburst’ visualisation automatically groups all 'idle' machines together into a single category.
There are a few interesting things to see here. Firstly, some machines are marked as 'idle' even though they have regular short periods of high activity. The reason for this is that the definition of idle has been set to include any machine that spends at least 95% of its run-time at less than 5% CPU. I could make this more sensitive by using a 99th percentile. In this example, there are 509 machines that meet this classification of 'idle'. They are all long running machines, some of which are stopped over the weekend but spend the rest of the week idle. It's estimated that switching these machines off could yield savings of up to $83,000 on on-demand costs.
There is more valuable idle time analysis that can be carried out. At ITRS, we also consider periodicity, which is the analysis of each workload to determine if that workload demonstrates reliable periods of idle time over several weeks.
If you understand this, then you can save money by stopping these machines during those times and then re-activating them when required.
An example of this can be seen below:
This view is highlighting those machines that are currently running but could be reliably switched off during particular time periods and the on-demand costs that could be saved as a result.
For example, the largest potential savings could be achieved by switching off machines that are idle for entire days at a time. In first entry of the above example, on the 'daily' row for Sunday, there are 176 instances that are idle for the entirety of every Sunday. Switching them off would save $5,252 every Sunday. This would add up to a saving of up to $273,104 per year.
Looking at machines that are idle for only parts of a day, the next biggest savings are highlighted in red above. Every Saturday between midnight and 8pm, 145 servers are reliably idle. Switching those machines off for that time alone could save $4,222 every Saturday, which adds up to almost $220,000 per year.
What if I've already paid for this stuff?
If you reserve instances before managing idle times and rightsizing, then switching these machines may not have a direct impact on your bill. That doesn't mean you shouldn't do it.
Switching off idle machines will allow RIs to be used by other instances that match the RI instance family/region and operating system combination. Switching off a very large unused instance of a particular family could free up space to run a much larger number of small instances without incurring on-demand costs.
This shouldn't be seen as a one-time job prior to committing to some kind of savings mechanism. Staying on top of your cloud workloads continuously will ensure you get the most out of your committed cloud investment in the long run.
Making the change
Once you've identified the machines and savings, the hard part is convincing their owners to switch off the machines.
With visuals and reporting such as the above, you can clearly demonstrate to application owners how little they are making use of some components of their estate.
We typically base our recommendations on at least three months of data, providing a full picture of the large components of the estate to summarise long periods of low activity. If terminating the instances isn't possible - then perhaps rightsizing should be recommended. If the machine can't be switched off, perhaps it could be run in a cheaper machine configuration.
Even if a machine is covered by a group of RIs, if it is genuinely idle, switching off should still be considered. That would free up capacity in your group of reservations that could be used by other machines, potentially allowing you to grow in some areas of the business without incurring further on-demand costs or lower savings rates from other mechanisms such as saving plans.
After tagging your instances consistently, the first step every consultant, expert or tool provider will tell you to take to save money in the cloud will be to deal with your idle machines.
However, you need to be doing that in a way that meets your business needs, provides the configurability you need to deal with different workload management requirements and backs up all recommendations with detailed analysis and the evidence you need to make changes to your organisation.