Learning Library

← Back to Library

Simplifying Monitoring with Golden Signals

5m • Unknown Channel • devops • tutorial • intermediate • Watch on YouTube ↗

Key Points

The traditional approach to monitoring complex micro‑service environments forces owners to chase numerous technology‑specific metrics and call multiple experts, slowing down root‑cause identification and increasing latency for end users.
Site Reliability Engineering (SRE) recommends focusing on only four “golden signals” – latency, errors, traffic, and saturation – rather than tracking every possible metric across heterogeneous services.
By applying the golden signals and leveraging APM tools that surface the immediate downstream dependencies (one hop away), teams can quickly eliminate services that are healthy and narrow the search space.
This streamlined, signal‑driven monitoring dramatically reduces mean time to recovery (MTTR) and helps maintain consistent end‑user performance despite a diverse tech stack.

Sections

00:00:00 Untitled Section

Full Transcript

# Simplifying Monitoring with Golden Signals **Source:** [https://www.youtube.com/watch?v=rnnhtzIgjvQ](https://www.youtube.com/watch?v=rnnhtzIgjvQ) **Duration:** 00:05:12 ## Summary - The traditional approach to monitoring complex micro‑service environments forces owners to chase numerous technology‑specific metrics and call multiple experts, slowing down root‑cause identification and increasing latency for end users. - Site Reliability Engineering (SRE) recommends focusing on only four “golden signals” – latency, errors, traffic, and saturation – rather than tracking every possible metric across heterogeneous services. - By applying the golden signals and leveraging APM tools that surface the immediate downstream dependencies (one hop away), teams can quickly eliminate services that are healthy and narrow the search space. - This streamlined, signal‑driven monitoring dramatically reduces mean time to recovery (MTTR) and helps maintain consistent end‑user performance despite a diverse tech stack. ## Sections - [00:00:00](https://www.youtube.com/watch?v=rnnhtzIgjvQ&t=0s) **Untitled Section** - ## Full Transcript

0:00today I'd like to talk a little about 0:01the site reliability or sre discipline 0:04and how we can apply it to simplifying 0:06monitoring for complex modern 0:08applications this will help us identify 0:10root causes more quickly and drastically 0:13reduce the mean time to recovery so that 0:15we can maintain the end-user performance 0:18that we want for our applications so 0:21first let's take a look at what happens 0:24before we've applied these SRE 0:26principles to our monitoring so let's 0:29say that I'm the owner of an application 0:33and I've gotten an alert that says that 0:37I'm having a latency issue now my 0:39application is really critical for this 0:42business and so I need to find the root 0:43cause quickly but because I'm part of 0:46this complex micro service topology it 0:49can be really difficult to figure out 0:51where exactly the root cause is coming 0:53from and to make things more complex all 0:57of my dependencies could be based on 0:59different technologies so let's say one 1:03is built on nodejs 1:06one is a db2 database another is written 1:10in Swift and so on now all of these have 1:15different metrics that are typically 1:16monitored and I may not be an expert in 1:19any of these different technologies so 1:21it may be difficult for me personally to 1:23go in and figure out what the problem is 1:26so I would have to call in a expert for 1:29each of these technologies now as you 1:31can imagine this is time consuming for 1:34everyone to go through their service 1:35figure out if there is a problem or if I 1:37need to keep going downstream and all 1:40the while my users are still 1:41experiencing this latency issue now what 1:46if there was a better way this is what 1:48we can learn from the SRA discipline 1:50which tells us that there's really only 1:51four key performance indicators that we 1:54need to monitor not all the different 1:57metrics for each technology and we call 1:59these golden signals 2:05so the golden signals are latency which 2:09is the time it takes to service a 2:10request errors which is a view of the 2:15request error rate traffic which is the 2:18demand placed on the system and 2:21saturation which is our utilization 2:24versus max capacity now let's go back to 2:28our initial example and see how this 2:30would work applying the golden signals 2:31so my service will call it service a we 2:38know we have a latency issue now we know 2:43that latency is typically a symptom and 2:45if we examine the service let's say 2:48we're not seeing any of the causes so we 2:50know we have to keep looking downstream 2:52but we don't want to go back to this 2:55complicated micro service topology and 2:57try and figure it all out 2:58so some APM tools can help you out with 3:02this by identifying only the services 3:04that are one hop away from my service in 3:07question so let's say we have services B 3:11C and D that are connected to my service 3:16a that's having the problem now no 3:20matter what technology these services 3:22are built on all we need to do is go 3:24look at the golden signals so let's say 3:27we look at the golden signals for 3:28service B and everything looks fine so 3:31we know service B is not the problem and 3:35let's say service C same scenario we 3:39don't see any issues so we can eliminate 3:40that as the problem now service D let's 3:44say that we're seeing an issue with our 3:47saturation which is trending upwards so 3:51right there after only a few minutes 3:52we've identified service D is likely our 3:57root cause so now instead of having to 4:00pull in the experts for each of these 4:02different services now we can go 4:05directly to service D and let them know 4:07that we've identified that they're 4:09likely a cause of this issue that we're 4:11having and they can go about fixing it 4:13and what's even better is if they're 4:16using golden signals to 4:17their service it's very likely they've 4:20already identified this and are already 4:21working on the fix so as you can see 4:25this process drastically improves the 4:29time that it takes to go through this 4:31complex topology and many different 4:34technologies to figure out where your 4:36root causes and identify exactly how to 4:38fix it so when you're identifying an APM 4:42tool to use make sure that it offers the 4:47ability to use these golden signals and 4:49this one hop dependency view so that you 4:52can quickly identify the root causes and 4:54get your service restored as quickly as 4:57possible thanks for watching this video 4:59on simplifying monitoring for modern 5:02applications 5:10you