What is your server’s mean time between failures?
Though we’d all love if all of our equipment worked perfectly all of the time, sadly this is not the case. Sometimes, you’re going to experience issues with your server, and it’s up to you to resolve them. The mean time between failures, known as MTBF, is the average amount of time your organization has between server failures, and having that number is a great way to determine the health of your server. The more time you have between failures, the better shape your server is in. If you find yourself with a very short MTBF, there may be something deeper occurring that needs to be investigated and remedied or it may be time to replace your server.
With the concept of MTBF, it is assumed that
- the root cause of the failures is repairable and
- that you actually repair the underlying issue.
If you want to get super technical and calculate your server’s MTBF, you’ll have to do some recording. Every time your server goes down, record the time it went down, then the time it went back up. Once you’ve recorded 3 or more failures, you can calculate your MTBF.
For the first failure, set the downtime as day 1. Then, for each failure, subtract the uptime of the previous failure from the downtown of the current failure.
Let’s say your first failure occurs on day 1, and the system is back up on day 3. Then the second failure occurs on day 40, and the system is back up on day 42. The third failure occurs on day 75, and the system is back up on day 76. To calculate your MTBF, you are calculating the time between failure 1 and 2 and then failure 2 and 3. Downtown – uptime for failure 2 is 40-3=37. Downtime – uptime for failure 3 is 75-42=33. Add 37+33 to get 70, then divide by the number of failures (2). Your MTBF is 35.
One of the biggest causes of a short mean time between failures is something known as “root cause versus symptoms.” In our next blog post, check out why fixing only the symptoms will hurt, not help, your server!