In August, as part of my ongoing discussion about monitoring maturity and how to make your monitors, reports and alerts more effective, I offered to give some real-life examples. In September, I provided a range of tweaks for monitor/alert combinations to make them more effective, actionable and relevant.
But sharp-eyed readers noted that these examples—although helpful—weren’t what I originally promised.
Back in August I said,
This calculation will give you the total cost for that issue in that period of time. If you do it for both the before and after phases, you know how much you’ve saved the company just by adding or improving monitoring and alerting. To help you out, I’ll share real-life examples.
Clearly, I wasn’t talking about alert examples in that context. What I meant was I would share some real-life examples of the before-and-after math used to not only justify the effort and cost that good monitoring requires, but to promote it as the value-recovery tool it is.
Before I could dig into that type of data, however, I needed to be sure that readers had a chance to see the changes I was proposing: how relatively simple tweaks could alter an alert from being redundant noise into notifications that were a meaningful call to action.
Now I’m ready to fulfill my original promise. Here are some examples of the impact—in dollars, cents, hours and even staff—that good monitoring has had in a various data centers. I learned of them from a variety of folks I’ve met in my journey as a monitoring engineer—people I’m proud to call both colleagues and friends.
As you read through these examples, focus less on the particulars (“Well, we don’t do it that way in our shop, so this is worthless”) and more on the process. Think about questions such as the following:
- Which metrics do each IT professional highlight as proof of their success?
- What is the net result?
- How do they phrase it?
Those are the gems you can mine and then use as your own the next time you have to answer the question, “Why do we bother with all this monitoring stuff, anyway?”
Also note that by its nature, a blog post won’t have all the details you might want. So, I’d humbly suggest you also check out my (completely free) e-book on the topic .
The Plural of Anecdote
Shared by Josh Biggley, monitoring engineer at Cardinal Health
“Although we developed some sophisticated alerts for ‘disk full’ monitoring, we still were creating over 700 tickets per month for this one event type, the largest volume of tickets for a single event in the enterprise. As we discussed it with the server teams, they pointed out that in the majority of cases, clearing the temp directory was all they had to do to resolve the issue.
So, we tested and rolled out a solution to do that task automatically. We always open a ticket, but if the disk-clearing script succeeds, we update the ticket as ‘deferred’ rather than ‘open’ and the support team isn’t paged out.
To give you a sense of how successful that is, we deferred 408 tickets between August 8 and August 31. We presume that responding to a ‘disk full’ event takes staff an average of 15 minutes to manage. But even with that minimal time, we’re talking about 17.7 events, or 4.4 hours per day.
Automation is saving us half of one staff member each and every day for just this one event type alone!”
Flagging the Problem Children, Then Expelling Them From School
Shared by Peter Monaghan, CBCP, SCP, ITIL ver.3
“We’ve set up similar scripting actions regarding disk space alerts and Windows Services unexpected stops. Since we don’t have a 24x7 NOC, we have introduced ‘repeated’ alerts for Tier 1 IT services so that if one of those services fails afterhours, scripts are initiated on alert generation to automatically restore services. If services aren’t restored, alerts are generated every 30–45 minutes (depending on the service) until either the script or manual intervention restores it.
Once I set up monitoring for these services, I was able to focus on the repeat offenders and ultimately reduce all chronic alerts and outages by over 70 percent for all Tier 1–3 IT services.”
Signal to Noise
Shared by Rick Schroeder, network administrator
“We experience approximately four unscheduled WAN outages per month. Our affected sites range from 5 employees to 300.
Using monitoring automation, we reduced the amount of downtime by 15 to 60 minutes per site, per incident, simply through increased visibility by getting the right information to the right teams faster so they were able to respond. Doing so included providing near real-time information on outages of the WAN vendor so that vendor could begin verifying and testing, perhaps bringing in last-mile providers and intermediate providers, or rolling a truck and technician to the site.
Some of the outages were six hours or more, if the issue was caused by backhoe fade or trees falling on aerial WAN connections.
Depending on the number of users affected, we calculated our savings at $86,400 per year for our larger sites and $375,000 for outages at our data center, which only occurred once every three years or so, but in addition to being costly, it’s also very visible.
And all of that doesn’t consider the intangible costs for customers lost and impressions left.”
Nobody Wants to Hear “Oops” in the Operating Room
Shared by C. A. Hunt
“At one point, the hospital where I worked had an in-house-developed EMR system that was distributed across several servers. The system would fail if a single service on any of those servers stopped unexpectedly. That application team’s way of managing this failure was logging each server to check the service/process status one at a time until they found the stopped or hung service. They would then restart the servers in a specific order that would allow the service/process to restart and the servers to reconnect with each other, bringing the EMR application back online.
Most outages averaged 1–2 hours and required the team of 4–6 developers ($50–$80 per hour) to search each server for that one failed service/process. Implementing monitoring alone saved the company no less than $200 per outage, just for the development team’s time, not to mention the cost of the entire hospital staff whose work was held up.
Another direct benefit was that by reducing the downtime from one hour to 30 minutes, hospital staff would only miss or delay one appointment instead of two or three. The impact on delayed surgeries was even more noticeable.”
The Right Information at the Right Time
Shared by Paul Guido, Systems Team, Regional Bank in south Texas
“We monitor approximately 100 data circuits. According to our records, a circuit goes out about 50 times a year owing to carrier issues, floods, dry spells, drunk drivers, ice storms and—everybody’s favorite—backhoes.
Before monitoring, we would have to wait anywhere from 30 to 120 minutes before IT knew about the issue. And to even see whether the issue was on our side, a person would be dispatched to the site. The typical drive time would be an hour. If the smart jack shows an error on the carrier side, a ticket would then be opened manually. In addition, circuit outages cost a branch operation six to eight additional hours per circuit incident.
After implementing monitoring automation, a ticket is opened with the proper carrier within 10 minutes. Our company and the carrier jointly investigate the outage to see what is at fault (us or them). And because all the required information is gathered and included in the ticket, every hour of downtime only causes three to four hours lost at the branch.
In a single year, circuit monitoring saves 250 to 400 hours of productivity.”
The Price of a Cup of Coffee
Shared by Kimberly, Sysadmin
“We have an application that approximately 470 developers use to drive their development cycle. The application would occasionally hang during a re-indexing overnight and would be unavailable when the majority of the devs arrived at work between 7 and 8 am.
I’m more of a 9 o’clocker myself, so either I’d get a call during my not-so-awake moments or they would have to wait for me to arrive at the office. Our devs make about $33.50 per hour. Before automation, the math looked like this: 470 developers down for two hours at $15,755 per hour, or approximately $31,508 in lost employee productivity.
I was able to set up a monitor for that re-indexing job, and when it hung, the monitor executed a script that restarted the service and sent a message to our operations center to look at the app and confirm that it was accessible after the restart.
From that point on, the devs could start to work right away, the company didn’t lose employee productivity and I got to drink my first cup of coffee uninterrupted. Truly priceless!”
Wrapping It All Up
Hopefully, over the last few posts I’ve shared a few tidbits that you can take with you: the idea that the time has come for monitoring to take its place as a separate discipline in IT; that monitoring, while not necessarily always easy, is simple; that alerts should (and can) always be actionable; and that—when done correctly—monitoring provides demonstrable value to the business in the form of cost savings and avoided risks. Finally, when presented in a compelling way, this all should speak to any executive who is paying attention and help you make the case for even more (awesome) monitoring to come.
About the Author
Leon Adato, SolarWinds Head Geek and longtime IT systems management and monitoring expert, discusses all things data center in this ongoing series.