2022 ObservabilityEngineering
- (Majors et al., 2022) ⇒ C. Majors, L. Fong-Jones, and G. Miranda. (2022). “Observability Engineering.” O'Reilly Media. ISBN:9781492076414
Subject Headings: Distributed System Observability, Observability-Driven Development.
Notes
Cited By
2022
- Authors v-blog post https://youtu.be/FZRpQOaePFU
Quotes
Book Overview
https://www.oreilly.com/library/view/observability-engineering/9781492076438/
Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand the experience of each and every user. This practical book explains the value of observable systems and shows you how to practice observability-driven development.
Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what youâ??re doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. Youâ??ll also learn the impact observability has on organizational culture (and vice versa).
You'll explore:
- How the concept of observability applies to managing software at scale
- The value of practicing observability when delivering complex cloud native applications and systems
- The impact observability has across the entire software development lifecycle
- How and why different functional teams use observability with service-level objectives
- How to instrument your code to help future engineers understand the code you wrote today
- How to produce quality code for context-aware system debugging and maintenance
- How data-rich analytics can help you debug elusive issues
Table of Contents
Preface
Who This Book Is For Why We Wrote This Book What You Will Learn Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments
I. The Path to Observability
1. What Is Observability? The Mathematical Definition of Observability Applying Observability to Software Systems Mischaracterizations About Observability for Software Why Observability Matters Now Is This Really the Best Way? Why Are Metrics and Monitoring Not Enough? Debugging with Metrics Versus Observability The Role of Cardinality The Role of Dimensionality Debugging with Observability Observability Is for Modern Systems Conclusion
2. How Debugging Practices Differ Between Observability and Monitoring How Monitoring Data Is Used for Debugging Troubleshooting Behaviors When Using Dashboards The Limitations of Troubleshooting by Intuition Traditional Monitoring Is Fundamentally Reactive How Observability Enables Better Debugging Conclusion
3. Lessons from Scaling Without Observability An Introduction to Parse Scaling at Parse The Evolution Toward Modern Systems The Evolution Toward Modern Practices Shifting Practices at Parse Conclusion
4. How Observability Relates to DevOps, SRE, and Cloud Native Cloud Native, DevOps, and SRE in a Nutshell Observability: Debugging Then Versus Now Observability Empowers DevOps and SRE Practices Conclusion
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability Debugging with Structured Events The Limitations of Metrics as a Building Block The Limitations of Traditional Logs as a Building Block Unstructured Logs Structured Logs Properties of Events That Are Useful in Debugging Conclusion
6. Stitching Events into Traces Distributed Tracing and Why It Matters Now The Components of Tracing Instrumenting a Trace the Hard Way Adding Custom Fields into Trace Spans Stitching Events into Traces Conclusion
7. Instrumentation with OpenTelemetry A Brief Introduction to Instrumentation Open Instrumentation Standards Instrumentation Using Code-Based Examples Start with Automatic Instrumentation Add Custom Instrumentation Send Instrumentation Data to a Backend System Conclusion
8. Analyzing Events to Achieve Observability Debugging from Known Conditions Debugging from First Principles Using the Core Analysis Loop Automating the Brute-Force Portion of the Core Analysis Loop This Misleading Promise of AIOps Conclusion
9. How Observability and Monitoring Come Together Where Monitoring Fits Where Observability Fits System Versus Software Considerations Assessing Your Organizational Needs Exceptions: Infrastructure Monitoring That Can’t Be Ignored Real-World Examples Conclusion
III. Observability for Teams
10. Applying Observability Practices in Your Team Join a Community Group Start with the Biggest Pain Points Buy Instead of Build Flesh Out Your Instrumentation Iteratively Look for Opportunities to Leverage Existing Efforts Prepare for the Hardest Last Push Conclusion
11. Observability-Driven Development Test-Driven Development Observability in the Development Cycle Determining Where to Debug Debugging in the Time of Microservices How Instrumentation Drives Observability Shifting Observability Left Using Observability to Speed Up Software Delivery Conclusion
12. Using Service-Level Objectives for Reliability Traditional Monitoring Approaches Create Dangerous Alert Fatigue Threshold Alerting Is for Known-Unknowns Only User Experience Is a North Star What Is a Service-Level Objective? Reliable Alerting with SLOs Changing Culture Toward SLO-Based Alerts: A Case Study Conclusion
13. Acting on and Debugging SLO-Based Alerts Alerting Before Your Error Budget Is Empty Framing Time as a Sliding Window Forecasting to Create a Predictive Burn Alert The Lookahead Window The Baseline Window Acting on SLO Burn Alerts Using Observability Data for SLOs Versus Time-Series Data Conclusion
14. Observability and the Software Supply Chain Why Slack Needed Observability Instrumentation: Shared Client Libraries and Dimensions Case Studies: Operationalizing the Supply Chain Understanding Context Through Tooling Embedding Actionable Alerting Understanding What Changed Conclusion
IV. Observability at Scale
15. Build Versus Buy and Return on Investment How to Analyze the ROI of Observability The Real Costs of Building Your Own The Hidden Costs of Using “Free” Software The Benefits of Building Your Own The Risks of Building Your Own The Real Costs of Buying Software The Hidden Financial Costs of Commercial Software The Hidden Nonfinancial Costs of Commercial Software The Benefits of Buying Commercial Software The Risks of Buying Commercial Software Buy Versus Build Is Not a Binary Choice Conclusion
16. Efficient Data Storage The Functional Requirements for Observability Time-Series Databases Are Inadequate for Observability Other Possible Data Stores Data Storage Strategies Case Study: The Implementation of Honeycomb’s Retriever Partitioning Data by Time Storing Data by Column Within Segments Performing Query Workloads Querying for Traces Querying Data in Real Time Making It Affordable with Tiering Making It Fast with Parallelism Dealing with High Cardinality Scaling and Durability Strategies Notes on Building Your Own Efficient Data Store Conclusion
17. Cheap and Accurate Enough: Sampling Sampling to Refine Your Data Collection Using Different Approaches to Sampling Constant-Probability Sampling Sampling on Recent Traffic Volume Sampling Based on Event Content (Keys) Combining per Key and Historical Methods Choosing Dynamic Sampling Options When to Make a Sampling Decision for Traces Translating Sampling Strategies into Code The Base Case Fixed-Rate Sampling Recording the Sample Rate Consistent Sampling Target Rate Sampling Having More Than One Static Sample Rate Sampling by Key and Target Rate Sampling with Dynamic Rates on Arbitrarily Many Keys Putting It All Together: Head and Tail per Key Target Rate Sampling Conclusion
18. Telemetry Management with Pipelines Attributes of Telemetry Pipelines Routing Security and Compliance Workload Isolation Data Buffering Capacity Management Data Filtering and Augmentation Data Transformation Ensuring Data Quality and Consistency Managing a Telemetry Pipeline: Anatomy Challenges When Managing a Telemetry Pipeline Performance Correctness Availability Reliability Isolation Data Freshness Use Case: Telemetry Management at Slack Metrics Aggregation Logs and Trace Events Open Source Alternatives Managing a Telemetry Pipeline: Build Versus Buy Conclusion
V. Spreading Observability Culture
19. The Business Case for Observability The Reactive Approach to Introducing Change The Return on Investment of Observability The Proactive Approach to Introducing Change Introducing Observability as a Practice Using the Appropriate Tools Instrumentation Data Storage and Analytics Rolling Out Tools to Your Teams Knowing When You Have Enough Observability Conclusion
20. Observability’s Stakeholders and Allies Recognizing Nonengineering Observability Needs Creating Observability Allies in Practice Customer Support Teams Customer Success and Product Teams Sales and Executive Teams Using Observability Versus Business Intelligence Tools Query Execution Time Accuracy Recency Structure Time Windows Ephemerality Using Observability and BI Tools Together in Practice Conclusion
21. An Observability Maturity Model A Note About Maturity Models Why Observability Needs a Maturity Model About the Observability Maturity Model Capabilities Referenced in the OMM Respond to System Failure with Resilience Deliver High-Quality Code Manage Complexity and Technical Debt Release on a Predictable Cadence Understand User Behavior Using the OMM for Your Organization Conclusion
22. Where to Go from Here Observability, Then Versus Now Additional Resources Predictions for Where Observability Is Going
Foreword
Over the past couple of years, the term “observability” has moved from the niche fringes of the systems engineering community to the vernacular of the software engineering community. As this term gained prominence, it also suffered the (alas, inevitable) fate of being used interchangeably with another term with which it shares a certain adjacency: “monitoring.”
What then followed was every bit as inevitable as it was unfortunate: monitoring tools and vendors started co-opting and using the same language and vocabulary used by those trying to differentiate the philosophical, technical, and sociotechnical underpinnings of observability from that of monitoring. This muddying of the waters wasn’t particularly helpful, to say the least. It risked conflating “observability” and “monitoring” into a homogenous construct, thereby making it all the more difficult to have meaningful, nuanced conversations about the differences.
To treat the difference between monitoring and observability as a purely semantic one is a folly. Observability isn’t purely a technical concept that can be achieved by buying an “observability tool” (no matter what any vendor might say) or adopting the open standard du jour. To the contrary, observability is more a sociotechnical concept. Successfully implementing observability depends just as much, if not more, on having the appropriate cultural scaffolding to support the way software is developed, deployed, debugged, and maintained, as it does on having the right tool at one’s disposal.
In most (perhaps even all) scenarios, teams need to leverage both monitoring and observability to successfully build and operate their services. But any such successful implementation requires that practitioners first understand the philosophical differences between the two. What separates monitoring from observability is the state space of system behavior, and moreover, how one might wish to explore the state space and at precisely what level of detail. By “state space,” I’m referring to all the possible emergent behaviors a system might exhibit during various phases: starting from when the system is being designed, to when the system is being developed, to when the system is being tested, to when the system is being deployed, to when the system is being exposed to users, to when the system is being debugged over the course of its lifetime. The more complex the system, the ever expanding and protean the state space.
Observability allows for this state space to be painstakingly mapped out and explored in granular detail with a fine-tooth comb. Such meticulous exploration is often required to better understand unpredictable, long-tail, or multimodal distributions in system behavior. Monitoring, in contrast, provides an approximation of overall system health in broad brushstrokes.
It thus follows that everything from the data that’s being collected to this end, to how this data is being stored, to how this data can be explored to better understand system behavior varies vis-à-vis the purposes of monitoring and observability.
Over the past couple of decades, the ethos of monitoring has influenced the development of myriad tools, systems, processes, and practices, many of which have become the de facto industry standard. Because these tools, systems, processes, and practices were designed for the explicit purpose of monitoring, they do a stellar job to this end. However, they cannot—and should not—be rebranded or marketed to unsuspecting customers as “observability” tools or processes. Doing so would provide little to no discernible benefit, in addition to running the risk of being an enormous time, effort, and money sink for the customer.
Furthermore, tools are only one part of the problem. Building or adopting observability tooling and practices that have proven to be successful at other companies won’t necessarily solve all the problems faced by one’s organization, inasmuch as a finished product doesn’t tell the story behind how the tooling and concomitant processes evolved, what overarching problems it aimed to solve, what implicit assumptions were baked into the product, and more.
Building or buying the right observability tool won’t be a panacea without first instituting a conducive cultural framework within the company that sets teams up for success. A mindset and culture rooted in the shibboleths of monitoring—dashboards, alerts, static thresholds—isn’t helpful to unlock the full potential of observability. An observability tool might have access to a very large volume of very granular data, but successfully making sense of the mountain of data—which is the ultimate arbiter of the overall viability and utility of the tool, and arguably that of observability itself!—requires a hypothesis-driven, iterative debugging mindset.
Simply having access to state-of-the-art tools doesn’t automatically cultivate this mindset in practitioners. Nor does waxing eloquent about nebulous philosophical distinctions between monitoring and observability without distilling these ideas into cross-cutting practical solutions. For instance, there are chapters in this book that take a dim view of holding up logs, metrics, and traces as the “three pillars of observability.” While the criticisms aren’t without merit, the truth is that logs, metrics, and traces have long been the only concrete examples of telemetry people running real systems in the real world have had at their disposal to debug their systems, and it was thus inevitable that the narrative of the “three pillars” cropped up around them.
What resonates best with practitioners building systems in the real world isn’t abstract, airy-fairy ideas but an actionable blueprint that addresses and proposes solutions to pressing technical and cultural problems they are facing. This book manages to bridge the chasm that yawns between the philosophical tenets of observability and its praxis thereof, by providing a concrete (if opinionated) blueprint of what putting these ideas into practice might look like.
Instead of focusing on protocols or standards or even low-level representations of various telemetry signals, the book envisages the three pillars of observability as the triad of structured events (or traces without a context field, as I like to call them), iterative verification of hypothesis (or hypothesis-driven debugging, as I like to call it), and the “core analysis loop.” This holistic reframing of the building blocks of observability from the first principles helps underscore that telemetry signals alone (or tools built to harness these signals) don’t make system behavior maximally observable. The book does not shirk away from shedding light on the challenges one might face when bootstrapping a culture of observability in an organization, and provides valuable guidance on how to go about it in a sustainable manner that should stand observability practitioners in good stead for long-term success.
...
I. The Path to Observability
1. What Is Observability?
2. How Debugging Practices Differ Between Observability and Monitoring
3. Lessons from Scaling Without Observability
4. How Observability Relates to DevOps, SRE, and Cloud Native
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability
6. Stitching Events into Traces
7. Instrumentation with OpenTelemetry
In the previous two chapters, we described the principles of structured events and tracing. Events and traces are the building blocks of observability that you can use to understand the behavior of your software applications. You can generate those fundamental building blocks by adding instrumentation code into your application to emit telemetry data alongside each invocation. You can then route the emitted telemetry data to a backend data store, so that you can later analyze it to understand application health and help debug issues.
...
...
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2022 ObservabilityEngineering | C. Majors L. Fong-Jones G. Miranda | Observability Engineering | 2022 |