Mean Time to Recovery (MTTR) Measure
A Mean Time to Recovery (MTTR) Measure is an software system operations measure based on how long it generally takes to restore service when a service incident occurs (e.g., unplanned outage, service impairment).
- Example(s):
- See: Redundant Array of Independent Disks, DORA Measure.
References
2022
- (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Mean_time_to_recovery Retrieved:2022-5-27.
- Mean time to recovery (MTTR) [1] is the average time that a device will take to recover from any failure. Examples of such devices range from self-resetting fuses (where the MTTR would be very short, probably seconds), up to whole systems which have to be repaired or replaced.
The MTTR would usually be part of a maintenance contract, where the user would pay more for a system MTTR of which was 24 hours, than for one of, say, 7 days. This does not mean the supplier is guaranteeing to have the system up and running again within 24 hours (or 7 days) of being notified of the failure. It does mean the average repair time will tend towards 24 hours (or 7 days). A more useful maintenance contract measure is the maximum time to recovery which can be easily measured and the supplier held accountable.
Note that some suppliers will interpret MTTR to mean 'mean time to respond' and others will take it to mean 'mean time to replace/repair/recover/resolve'. The former indicates that the supplier will acknowledge a problem and initiate mitigation within a certain timeframe. Some systems may have an MTTR of zero, which means that they have redundant components which can take over the instant the primary one fails, see RAID for example. However, the failed device involved in this redundant configuration still needs to be returned to service and hence the device itself has a non-zero MTTR even if the system as a whole (through redundancy) has an MTTR of zero. But, as long as service is maintained, this is a minor issue.
- Mean time to recovery (MTTR) [1] is the average time that a device will take to recover from any failure. Examples of such devices range from self-resetting fuses (where the MTTR would be very short, probably seconds), up to whole systems which have to be repaired or replaced.
2022
- https://devops-research.com/quickcheck.html
- QUOTE: ... For the primary application or service you work on, how long does it generally take to restore service when a service incident or a defect that impacts users occurs (for example, unplanned outage, service impairment)?
More than six months One to six months One week to one month One day to one week Less than one day Less than one hour
2017
- (Forsgren et al., 2017) ⇒ Nicole Forsgren, Monica Chiarini Tremblay, Debra VanderMeer, and Jez Humble. (2017). “DORA Platform: DevOps Assessment and Benchmarking.” In: International Conference on Design Science Research in Information System and Technology, Springer. doi:10.1007/978-3-319-59144-5_27
- QUOTE: ... IT performance is comprised of four measurements: lead time for changes, deploy frequency, mean time to restore (MTTR), and change fail rate. ... MTTR is long it generally takes to restore service when a service incident occurs (e.g., unplanned outage, service impairment). ...