You’ve done it: You’ve managed to migrate your apps to the cloud. Or maybe you finally set up your DevOps team on their own cloud so they can do all their testing without disrupting production machines. Or you’ve finally gotten your arms around the whole “continuous delivery” idea. Now, how do you know when there is something wrong with your apps? Or your infrastructure? Is it when an employee calls your department in a panic? Or when you get a dreaded text at 3 a.m. saying something cryptic like “APPLICATION DOWN”?
Any downtime or deviation from the usual workflow is enough to get your employees—or even your customers—hunting you down to find out what happened. Unfortunately, many problems go undetected for quite some time, only because IT departments have not come to grips with what it means to monitor their assets in the cloud.
Cloud Monitoring is Different from Server Monitoring
Cloud monitoring is the use of special tools to monitor and manage cloud computing architecture, infrastructure, and services. This definition can be deceptive, however. After all, when dealing with the cloud, aren’t we still concerned with things like uptime and compute resources? What really makes cloud monitoring so different from traditional server monitoring?
For one thing, enterprise organizations these days rarely rely on a handful of apps talking to a single database. They increasingly are deploying a host of containers and/or microservices, with dynamic instances deployed on demand. This means that many of the monitoring services designed for on-prem just won’t cut it in a cloud environment (especially a multi-cloud environment). Cloud monitoring solutions need to take into account things like:
Autoscaling. Autoscaling is not an issue for on-prem servers, but it is one of the core features organizations want when moving to the cloud. Applications instances should be scaled down during non-peak times, for example, and scaled back up as demand increases. Failing to do so could mean unnecessary costs.
Dependencies. The web of dependencies between applications, databases, and so on did not matter much when all hardware was on-prem, and applications were largely built and deployed independently of each other. In the cloud, keeping track of dependencies is much more important—especially if you are working with multiple clouds and often shift workloads.
Compliance. Monitoring for compliance in a cloud environment needs to be an ongoing activity, especially because deployment tends to be continuous. Doing compliance checks in batches every quarter just won’t do.
Performance. Monitoring performance in the cloud is much more than monitoring uptime. Measuring an application’s response time, as well as the response times of every step or function being called, is vital to understanding what’s hindering performance.
Cloud Monitoring Needs
If monitoring cloud resources are so different from on-prem monitoring, it’s obvious that new and different tools are needed. But that’s where the trouble starts: There are dozens of different monitoring tools on the market, both from public cloud platform providers and from third partners. (Full disclosure: We have one such tool, TRiA, designed to help with monitoring, security, and compliance in multi-cloud environments.)
Let’s take a step back and, instead of comparing the various cloud monitoring tools out there, discuss what sorts of cloud monitoring needs an enterprise organization might have, and how these could affect the search.
There are some fairly standard KPIs that any tool should track:
- Uptime/application availability (in real-time). This should be for all applications and infrastructure.
- Length of downtime. If something does cease working, the system should keep statistics on how long it takes to rectify the situation.
- Usage. This can include anything and everything from CPU and memory usage to network latency and load balancing.
- Error rates. How often do failed requests happen, as a percentage of overall requests?
- Latency. On average, how long does it take an app to process a request?
All monitoring tools should be able to alert the appropriate parties when something is amiss. But how alerts are handled is as important as what they are alerting you to.
- Proper frequency. If alerts happen all the time or are excessively intrusive, people will begin to ignore them (the “false alarm” effect). If alerts do not occur when something serious happens, though, then what’s the point of the monitoring tool?
- Specificity. The alert itself should be routed to the person best able to take action and should contain enough information so that person knows what happened, how serious it was, and how to fix it.
- Forewarning. Alerts should not only inform people when something has gone wrong but also look for trends and alert the appropriate parties when there is an impending issue.
- Outlier alerting. This is a method whereby unusual patterns are spotted—for example, unexpected spikes in traffic or requests. These can sometimes indicate that something has happened that needs addressing.
Managing cloud costs is one of the top challenges for DevOps and cloud excellence teams. Good monitoring tools are an essential part of containing these costs by helping you spot waste, manage resources efficiently, and automate certain cost-saving measures.
- Resources costs. Cost of storage, CPU use, and so on. This might also include things like egress charges, especially in a multi-cloud environment.
- Idle resources. These should be spun down or de-instanced to prevent cloud sprawl…and the associated waste.
- Reserved instances. When possible, your tool should identify opportunities for using reserved instances at lower costs.
- Open and closed tickets. Don’t forget the “human” side of the cost equation. Excessive tickets could indicate structural problems that aren’t being addressed, not to mention the staff that is tied up with troubleshooting instead of doing more important things. Closing tickets is important, too, to make sure issues are being resolved in a timely fashion.