Chapter 10 overview

How to Monitor Your Systems

Learn how to get visibility into your systems, including how to server metrics, application logs, structured events, distributed traces, alerts, on-call rotations, and more.

Key ideas you'll learn

  • Logs and log aggregation
  • Metrics, dashboards, alerts
  • Observability and tracing

Examples you'll try

  • Create a dashboard in CloudWatch
  • Do structured logging with Node.js
  • Set up Route 53 health checks and alerts

Table of contents

10.1 Logs
10.1.1 Log Levels
10.1.2 Log Formatting
10.1.3 Structured Logging
10.1.4 Log Files and Rotation
10.1.5 Log Aggregation
10.2 Metrics
10.2.1 Types of Metrics
Availability metrics
Business metrics
Application metrics
Server metrics
Team metrics
10.2.2 Using Metrics
Collect metrics (instrumentation)
Store metrics
Visualize and analyze metrics
10.2.3 Example: Metrics in CloudWatch
10.3 Events
10.3.1 Observability
10.3.2 Tracing
10.3.3 Testing in Production (TIP)
10.4 Alerts
10.4.1 Triggers
10.4.2 Notifications
10.4.3 On-Call
10.4.4 Incident Response
10.4.5 Example: Alerts in CloudWatch
10.5 Conclusion

Related Books

Practical Monitoring: Effective Strategies for the Real World

By Mike Julian (O'Reilly)

Do you have a nagging feeling that your monitoring needs improvement, but you just aren’t sure where to start or how to do it? Are you plagued by constant, meaningless alerts? Does your monitoring system routinely miss real problems? This is the book for you. Mike Julian lays out a practical approach to designing and implementing effective monitoring—from your enterprise application down to the hardware in a datacenter, and everything between. Practical Monitoring provides you with straightforward strategies and tactics for designing and implementing a strong monitoring foundation for your company. This book takes a unique vendor-neutral approach to monitoring. Rather than discuss how to implement specific tools, Mike teaches the principles and underlying mechanics behind monitoring so you can implement the lessons in any tool.

Observability Engineering: Achieving Production Excellence

By Charity Majors, Liz Fong-Jones, George Miranda (O'Reilly)

Observability is critical for engineering, managing, and improving complex business-critical systems. Through this process, any software engineering team can gain a deeper understanding of system performance, so you can perform ongoing maintenance and ship the features your customers need. This practical book explains the value of observable systems and shows you how to build an observability-driven development practice. Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to make improvements from what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics monitoring and log management. You'll also learn the impact observability has on organization culture.

Effective Monitoring and Alerting: For Web Operations

By Slawek Ligus (O'Reilly)

With this practical book, you’ll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring and alerting that enables you to maintain high availability and deliver a high quality of service. Learn methods for measuring state changes and data flow in your system, and set up alerts to help you recover quickly from problems when they do arise. If you’re a system operator waging the daily battle to provide the best performance at the lowest cost, this book is for you.

The Site Reliability Workbook: Practical Ways to Implement SRE

By Betsy Beyer, Niall Murphy, David Rensin, Kent Kawahara, and Stephen Thorne (O'Reilly)

In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment. This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t. Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.

How to Measure Anything: Finding the Value of Intangibles in Business

By Douglas W. Hubbard (Wiley)

recommended

Now updated with new measurement methods and new examples, How to Measure Anything shows managers how to inform themselves in order to make less risky, more profitable business decisions. This insightful and eloquent book will show you how to measure those things in your own business, government agency or other organization that, until now, you may have considered 'immeasurable,' including customer satisfaction, organizational flexibility, technology risk, and technology ROI.

Other Related Resources

How long will your experiment take?

By Dan McKinley (Calculator)

recommended

A calculator to estimate how long you'll need to run an experiment (e.g., A/B test) to get statistically meaningful results.

Logging Tools

Log4j

Apache Log4j is a versatile, industrial-grade Java logging framework composed of an API, its implementation, and components to assist the deployment for various use cases.

Winston

recommended used-in-book

A logger for just about everything.

zap

Blazing fast, structured, leveled logging in Go.

Loguru

Python logging made (stupidly) simple.

logrotate

The logrotate utility is designed to simplify the administration of log files on a system which generates a lot of log files.

Log Aggregation Tools

Loggly

Log Analysis / Log Management by Loggly: the world's most popular log analysis & monitoring in the cloud. Free trial. See why ⅓ of the Fortune 500 use us!

Papertrail

Frustration-free log management. Seamlessly manage logs from apps, servers, and cloud services.

Sumo Logic

Sumo Logic provides best-in-class cloud monitoring, log management, Cloud SIEM tools, and real-time insights for web and SaaS based apps.

Splunk

Splunk is the key to enterprise resilience. Our platform enables organizations around the world to prevent major issues, absorb shocks and accelerate digital transformation.

Graylog

Graylog is a leading centralized log management solution for capturing, storing, and enabling real-time analysis of terabytes of machine data.

Elasticsearch

Elasticsearch is the leading distributed, RESTful, open source search and analytics engine designed for speed, horizontal scalability, reliability, and easy management. Get started for free.

Logstash

Logstash (part of the Elastic Stack) integrates data from any source, in any format with this flexible, open source collection, parsing, and enrichment pipeline. Download for free.

OpenSearch

OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.

OpenSearch Logstash

Logstash is a real-time event processing engine. It’s part of the OpenSearch stack which includes OpenSearch, Beats, and OpenSearch Dashboards. You can send events to Logstash from many different sources. Logstash processes the events and sends it one or more destinations. For example, you can send access logs from a web server to Logstash. Logstash extracts useful information from each log and sends it to a destination like OpenSearch. Sending events to Logstash lets you decouple event processing from your app. Your app only needs to send events to Logstash and doesn’t need to know anything about what happens to the events afterwards.

Datadog

Datadog Log Management enables you to collect, monitor, manage, and analyze large volumes of logs as well as unify metrics and traces all in one platform.

New Relic

Deploying log management in context and at scale has never been faster, easier, or more attainable.

Amazon CloudWatch Logs

used-in-book

You can use Amazon CloudWatch Logs to monitor, store, and access your log files from Amazon Elastic Compute Cloud (Amazon EC2) instances, AWS CloudTrail, Route 53, and other sources. CloudWatch Logs enables you to centralize the logs from all of your systems, applications, and AWS services that you use, in a single, highly scalable service. You can then easily view them, search them for specific error codes or patterns, filter them based on specific fields, or archive them securely for future analysis. CloudWatch Logs enables you to see all of your logs, regardless of their source, as a single and consistent flow of events ordered by time. CloudWatch Logs also supports querying your logs with a powerful query language, auditing and masking sensitive data in logs, and generating metrics from logs using filters or an embedded log format.

Google Cloud Logging

Cloud Logging empowers customers to manage, analyze, monitor, and gain insights from log data in real time.

Azure Monitor Logs

Azure Monitor Logs is a centralized software as a service (SaaS) platform for collecting, analyzing, and acting on telemetry data generated by Azure and non-Azure resources and applications. You can collect logs, manage log data and costs, and consume different types of data in one Log Analytics workspace, the primary Azure Monitor Logs resource. This means you never have to move data or manage other storage, and you can retain different data types for as long or as little as you need.

Syslog

syslog is a standard for message logging. It allows separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them. Each message is labeled with a facility code, indicating the type of system generating the message, and is assigned a severity level.

Fluentd

Fluentd is an open source data collector for unified logging layer. Fluentd allows you to unify data collection and consumption for a better use and understanding of data.

Fluent Bit

Fluent Bit is a super fast, lightweight, and highly scalable logging, metrics, and traces processor and forwarder. It is the preferred choice for cloud and containerized environments.

Beats

The open source platform for building shippers for log, network, infrastructure data, and more — and integrates with Elasticsearch, Logstash & Kibana.

Vector

A lightweight, ultra-fast tool for building observability pipelines.

OpenTelemetry Collector

Vendor-agnostic way to receive, process and export telemetry data.

Availability Metrics Tools

Pingdom

Simplified website monitoring. We help you deliver exceptional customer experience with real-time, actionable insights into your website uptime and performance, so you can keep your users coming back again and again.

Uptime Robot

recommended

Start monitoring in 30 seconds. Use advanced SSL, keyword and cron monitoring. Get notified by email, SMS, Slack and more. Get 50 monitors for FREE!

Better Stack

Ship higher-quality software faster. Be the hero of your engineering teams. Start for free.

Uptime.com

The website monitoring service for unmatched uptime monitoring. Start monitoring your websites, APIs, cron jobs and more. Unlimited email, SMS, Slack notifications.

Datadog Uptime Monitoring

Test your website from locations around the world, or secured private locations, to monitor your uptime from any business critical location. Try it free.

Amazon Route 53 Health Checks

used-in-book

Configure Route 53 to check the health of your resources and to respond to DNS queries using only healthy resources.

Google Cloud Uptime Checks

A public uptime check can issue requests from multiple locations throughout the world to publicly available URLs or Google Cloud resources to see whether the resource responds.

Business Metrics Tools

Google Analytics

Google Analytics gives you the tools you need to better understand your customers. You can then use those business insights to take action, such as improving your website, creating tailored audience lists, and more.

KissMetrics

Simple analytics that track human behavior to increase revenue.

MixPanel

Helping the world learn from its data with event analytics everyone can use. Let’s build.

PostHog

PostHog is the only all-in-one platform for product analytics, feature flags, session replays, experiments, and surveys that's built for developers.

Fathom Analytics

Ditch complex, intrusive analytics for Fathom - a better Google analytics alternative. Experience ease of use, forever data retention & full legal compliance.

Overtracking

Overtracking is a zero-cookies and advanced web analytic tool. No cookies and compliant with GDPR, CCPA and PECR. Google Analytics alternative privacy-free.

Matomo

Matomo's the Google Analytics alternative that protects your data and your customer's privacy. A powerful web analytics platform with 100% data ownership.

Simple Analytics

Simple Analytics is the privacy-first Google Analytics alternative that is 100% GDPR compliant. Give us a try!

Cloudflare Web Analytics

Privacy-first, accurate, essential web analytics - for free. Cloudflare Web Analytics allows you to view and track essential stats on the usage of your website.

Amplitude

Build better products by turning your user data into meaningful insights, using Amplitude's digital analytics platform and experimentation tools.

Heap

Heap is the only digital insights platform that shows everything users do on your site, revealing the 'unknown unknowns' that stay invisible with other tools.

Application Performance Monitoring (APM) Tools

Datadog

recommended

Easily monitor service health metrics, distributed traces, and code performance with cloud-scale Application Performance Monitoring (APM).

New Relic

Application monitoring a set of tools and software used to monitor and optimize the performance of software applications. Detect and diagnose issues quickly.

Dynatrace

Innovate faster, operate more efficiently, and drive better business outcomes with observability, AI, automation, and application security in one platform.

AppDynamics

Get unified observability across any environment, any stack. Ensure resilience of digital systems, identify problems proactively, find root causes, and resolve them fast.

Amazon CloudWatch

used-in-book

Observe and monitor resources and applications on AWS, on premises, and on other clouds.

Google Cloud Monitoring

Gain visibility into the performance, uptime, and overall health of cloud-powered apps on Google Cloud and other cloud or on-premises environments.

Azure Monitor

Azure Monitor is a comprehensive monitoring solution for collecting, analyzing, and responding to monitoring data from your cloud and on-premises environments. You can use Azure Monitor to maximize the availability and performance of your applications and services. It helps you understand how your applications are performing and allows you to manually and programmatically respond to system events. Azure Monitor collects and aggregates the data from every layer and component of your system across multiple Azure and non-Azure subscriptions and tenants. It stores it in a common data platform for consumption by a common set of tools that can correlate, analyze, visualize, and/or respond to the data. You can also integrate other Microsoft and non-Microsoft tools.

Real User Monitoring (RUM) Tools

DataDog RUM

Datadog’s Real User Monitoring enables IT teams with user data and metrics to optimize frontend performance. Learn how to get started with RUM and begin enhancing performance.

New Relic Browser Monitoring

Unlock a better user experience with browser monitoring from New Relic. The world’s most deployed real user monitoring (RUM) solution.

Dynatrace RUM

Learn about how real user monitoring (RUM) can help your organization capture all relevant data of your users, providing a true picture into their experience.

AppDynamics Browser RUM

Optimize hybrid and on-prem application performance with full-stack observability linked to business performance.

Sematext Experience

Detect anomalies in real time & receive alerts when end-user experience is affected by website performance. Start now!

Akamai mPulse

Measure user experience and performance data to improve site performance.

Raygun RUM

Monitor Real User Sessions & Improve The Frontend Performance Of Your Software For Both Web & Mobile Using Raygun's Real User Monitoring Tool.

Server Metrics Tools

Nagios

Enterprise Grade Monitoring Powered By Open Source. Built on over 25 years of monitoring experience, the Nagios Core Services Platform provides insightful monitoring dashboards, time-saving monitoring wizards, and unmatched ease of use. Use it for free indefinitely.

Icinga

Tackle the monitoring challenge » Get a complete overview of all your systems and applications. Flexible, scalable and automated monitoring.

Zabbix

Zabbix is an enterprise-class, open-source monitoring solution that makes network and application monitoring simple.

Domotz

Domotz is a network monitoring software designed for IT professionals and MSPs. Gain real-time visibility on any IT infrastructure. Start Your Free Trial Today!

Sensu

The Observability Pipeline that delivers monitoring as code on any cloud.

Metrics Collection Tools

collectd

collectd is a daemon collecting system and application performance metrics periodically and provides mechanisms to store the values in a variety of ways, for example in RRD files.

Fluent Bit

Fluent Bit is a super fast, lightweight, and highly scalable logging, metrics, and traces processor and forwarder. It is the preferred choice for cloud and containerized environments.

StatsD

A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP or TCP and sends aggregates to one or more pluggable backend services (e.g., Graphite).

Telegraf

Telegraf is an open source server agent that makes it easy to collect metrics, logs, and data. Download the latest Telegraf and get release updates free!

Beats

The open source platform for building shippers for log, network, infrastructure data, and more — and integrates with Elasticsearch, Logstash & Kibana.

OpenTelemetry

recommended

High-quality, ubiquitous, and portable telemetry to enable effective observability.

Metrics Backend Tools

Open Time Series Database (OpenTSDB)

Store and serve massive amounts of time series data without losing granularity.

Prometheus

recommended

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

Timescale

Engineered to handle demanding workloads, like time series, vector, events, and analytics data. Built on PostgreSQL, with expert support at no extra charge.

InfluxDB

Manage all types of time series data in a single, purpose-built database. Optimized for speed in any environment in the cloud, on-premises, or at the edge.

Graphite

Graphite is an enterprise-ready monitoring tool that runs equally well on cheap hardware or Cloud infrastructure. Teams use Graphite to track the performance of their websites, applications, business services, and networked servers. It marked the start of a new generation of monitoring tools, making it easier than ever to store, retrieve, share, and visualize time-series data.

QuestDB

QuestDB is the world's fastest growing open-source time-series database. It offers massive ingestion throughput, millisecond queries, powerful time-series SQL extensions, and scales well with minimal and maximal hardware. Save costs with better performance and efficiency.

Metrics Frontend Tools

Grafana

Grafana is the open source analytics & monitoring solution for every database.

Perses

An open specification for dashboards. The open dashboard tool for Prometheus and other data sources.

Kibana

Download Kibana or the complete Elastic Stack for free and start visualizing, analyzing, and exploring your data with Elastic in minutes.

OpenSearch Dashboards

OpenSearch Dashboards is the user interface that lets you visualize your OpenSearch data and run and scale your OpenSearch clusters.

Observability Tools

Honeycomb

recommended

Honeycomb is the only observability platform you need. Get all your data in one unified platform with limitless possibilities.

SigNoz

SigNoz is an open-source observability tool powered by OpenTelemetry. Get APM, logs, traces, metrics, exceptions, & alerts in a single tool.

Uptrace

Uptrace is an OpenTelemetry-based observability platform that helps you monitor, understand, and optimize complex distributed systems. Think of DataDog or NewRelic, but at a fraction of the cost and with a fixed budget.

ServiceNow Observability

Break down silos to resolve issues quickly across teams. Integrate observability capabilities into your existing workflows for alerting and incident management.

Chronosphere

Chronosphere is the only cloud native observability platform that helps teams quickly resolve incidents while controlling costs. Learn how.

OpenObserve

Logs, Metrics, Traces and more in one platform. Streamline your operations with worry-free observability and simplify your observability setup in just 2 minutes. 140x lower storage cost than your existing observability tools.

IBM Instana

Harness the power of AI and automation to proactively solve issues across the application stack with IBM Instana Observability.

Distributed Tracing Tools

Zipkin

Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in service architectures. Features include both the collection and lookup of this data. If you have a trace ID in a log file, you can jump directly to it. Otherwise, you can query based on attributes such as service, operation name, tags and duration. Some interesting data will be summarized for you, such as the percentage of time spent in a service, and whether or not operations failed.

Jaeger

Monitor and troubleshoot workflows in complex distributed systems.

AWS X-Ray

Analyze and debug production and distributed applications.

Google Cloud Trace

A distributed tracing system that collects latency data from your applications and displays it in the Google Cloud console.

On-Call Tools

PagerDuty

Transform critical operations with PagerDuty's AI first Operations Platform. Harness agentic AI and automation to accelerate work and build resilience.

Splunk On-Call

Accelerate incident response with Splunk On-Call: automated scheduling, intelligent routing, and machine learning mean less downtime and more insights.

Opsgenie

Opsgenie is the #1 alerting and incident response tool. Eliminate downtime, enhance team coordination, and improve response times. Get started today!

Blameless

Blameless is an incident management workflow solution that carries teams through a codified playbook from start to finish in one fluid motion.

Squadcast

Squadcast is a full stack Reliability Automation and Incident Response Platform that's designed to help you promote SRE best practices. Try it for free now!

All Quiet

Resolve incidents faster. Real-time alerting. Mobile apps for iOS and Android. On-call scheduling and escalation policies for tech teams.

Rootly

AI-powered on-call and incident response.

Issue Tracker Tools

Jira

Make the impossible, possible in Jira. Plan, track, and release world-class software with the number one project management tool for agile teams.

Linear

recommended

Linear streamlines issues, projects, and roadmaps. Purpose-built for modern product development.

GitHub Issues

recommended

Give your developers flexible features for project management that adapts to any team, project, and workflow—all alongside your code.

Asana

Work anytime, anywhere with Asana. Keep remote and distributed teams, and your entire organization, focused on their goals, projects, and tasks with Asana.

Code Quality Tools

Codacy

Build clean, secure code efficiently and fearlessly with Codacy Platform.

SonarQube

Empower development teams with a code quality & security solution that deeply integrates into your enterprise environment that enables you to deploy Clean Code securely, consistently and reliably.

Code Climate

Code Climate is a Software Engineering Intelligence (SEI) solutions partner to engineering executives at enterprise organizations. We empower leaders to….

Coverity

Coverity Scan is a service by which Black Duck provides the results of analysis on open source coding projects to open source code developers that have registered their products with Coverity Scan.

Snyk

Enable developers to build securely from the start while giving security teams complete visibility and comprehensive controls.

Developer Productivity Tools

Pluralsight

Pluralsight helps organizations, teams, and individuals build better products with online courses and data-driven insights that fuel skill development and improve processes.

Faros AI

Faros AI is a copilot for enterprise technology organizations. We turn engineering productivity metrics into actionable engineering intelligence, helping leaders and teams maximize value and ROI through great developer experiences and outcomes.

LinearB

LinearB is the leading platform for Software Engineering Intelligence, helping engineering leaders improve efficiency and align R&D investments with business goals.

Swarmia

Build better software faster with insights that power your whole engineering organization. Get started with a demo or a free 14-day trial.

Oobeya

Accelerate the software development process with Oobeya, engineering intelligence and DORA Metrics tracking platform, complete operational visibility and optimization.

Comments