Alerting System

From GM-RKB
Revision as of 02:23, 6 November 2019 by Gmelli (talk | contribs) (Created page with "An [[]] is a system that the summary is ```Summary When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier: Pages sh...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

An [[]] is a system that

the summary is ```Summary

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:

   Pages should be urgent, important, actionable, and real.
   They should represent either ongoing or imminent problems with your service.
   Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
   You should almost always be able to classify the problem into one of: availability & basic functionality; latency; correctness (completeness, freshness and durability of data); and feature-specific problems.
   Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
   Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
   The further up your serving stack you go, the more distinct problems you catch in a single rule. But don't go so far you can't sufficiently distinguish what's going on.
   If you want a quiet oncall rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical.

```