Learning Library

← Back to Library

Simplifying Monitoring with Golden Signals

Key Points

  • The traditional approach to monitoring complex micro‑service environments forces owners to chase numerous technology‑specific metrics and call multiple experts, slowing down root‑cause identification and increasing latency for end users.
  • Site Reliability Engineering (SRE) recommends focusing on only four “golden signals” – latency, errors, traffic, and saturation – rather than tracking every possible metric across heterogeneous services.
  • By applying the golden signals and leveraging APM tools that surface the immediate downstream dependencies (one hop away), teams can quickly eliminate services that are healthy and narrow the search space.
  • This streamlined, signal‑driven monitoring dramatically reduces mean time to recovery (MTTR) and helps maintain consistent end‑user performance despite a diverse tech stack.

Full Transcript

# Simplifying Monitoring with Golden Signals **Source:** [https://www.youtube.com/watch?v=rnnhtzIgjvQ](https://www.youtube.com/watch?v=rnnhtzIgjvQ) **Duration:** 00:05:12 ## Summary - The traditional approach to monitoring complex micro‑service environments forces owners to chase numerous technology‑specific metrics and call multiple experts, slowing down root‑cause identification and increasing latency for end users. - Site Reliability Engineering (SRE) recommends focusing on only four “golden signals” – latency, errors, traffic, and saturation – rather than tracking every possible metric across heterogeneous services. - By applying the golden signals and leveraging APM tools that surface the immediate downstream dependencies (one hop away), teams can quickly eliminate services that are healthy and narrow the search space. - This streamlined, signal‑driven monitoring dramatically reduces mean time to recovery (MTTR) and helps maintain consistent end‑user performance despite a diverse tech stack. ## Sections - [00:00:00](https://www.youtube.com/watch?v=rnnhtzIgjvQ&t=0s) **Untitled Section** - ## Full Transcript
0:00today I'd like to talk a little about 0:01the site reliability or sre discipline 0:04and how we can apply it to simplifying 0:06monitoring for complex modern 0:08applications this will help us identify 0:10root causes more quickly and drastically 0:13reduce the mean time to recovery so that 0:15we can maintain the end-user performance 0:18that we want for our applications so 0:21first let's take a look at what happens 0:24before we've applied these SRE 0:26principles to our monitoring so let's 0:29say that I'm the owner of an application 0:33and I've gotten an alert that says that 0:37I'm having a latency issue now my 0:39application is really critical for this 0:42business and so I need to find the root 0:43cause quickly but because I'm part of 0:46this complex micro service topology it 0:49can be really difficult to figure out 0:51where exactly the root cause is coming 0:53from and to make things more complex all 0:57of my dependencies could be based on 0:59different technologies so let's say one 1:03is built on nodejs 1:06one is a db2 database another is written 1:10in Swift and so on now all of these have 1:15different metrics that are typically 1:16monitored and I may not be an expert in 1:19any of these different technologies so 1:21it may be difficult for me personally to 1:23go in and figure out what the problem is 1:26so I would have to call in a expert for 1:29each of these technologies now as you 1:31can imagine this is time consuming for 1:34everyone to go through their service 1:35figure out if there is a problem or if I 1:37need to keep going downstream and all 1:40the while my users are still 1:41experiencing this latency issue now what 1:46if there was a better way this is what 1:48we can learn from the SRA discipline 1:50which tells us that there's really only 1:51four key performance indicators that we 1:54need to monitor not all the different 1:57metrics for each technology and we call 1:59these golden signals 2:05so the golden signals are latency which 2:09is the time it takes to service a 2:10request errors which is a view of the 2:15request error rate traffic which is the 2:18demand placed on the system and 2:21saturation which is our utilization 2:24versus max capacity now let's go back to 2:28our initial example and see how this 2:30would work applying the golden signals 2:31so my service will call it service a we 2:38know we have a latency issue now we know 2:43that latency is typically a symptom and 2:45if we examine the service let's say 2:48we're not seeing any of the causes so we 2:50know we have to keep looking downstream 2:52but we don't want to go back to this 2:55complicated micro service topology and 2:57try and figure it all out 2:58so some APM tools can help you out with 3:02this by identifying only the services 3:04that are one hop away from my service in 3:07question so let's say we have services B 3:11C and D that are connected to my service 3:16a that's having the problem now no 3:20matter what technology these services 3:22are built on all we need to do is go 3:24look at the golden signals so let's say 3:27we look at the golden signals for 3:28service B and everything looks fine so 3:31we know service B is not the problem and 3:35let's say service C same scenario we 3:39don't see any issues so we can eliminate 3:40that as the problem now service D let's 3:44say that we're seeing an issue with our 3:47saturation which is trending upwards so 3:51right there after only a few minutes 3:52we've identified service D is likely our 3:57root cause so now instead of having to 4:00pull in the experts for each of these 4:02different services now we can go 4:05directly to service D and let them know 4:07that we've identified that they're 4:09likely a cause of this issue that we're 4:11having and they can go about fixing it 4:13and what's even better is if they're 4:16using golden signals to 4:17their service it's very likely they've 4:20already identified this and are already 4:21working on the fix so as you can see 4:25this process drastically improves the 4:29time that it takes to go through this 4:31complex topology and many different 4:34technologies to figure out where your 4:36root causes and identify exactly how to 4:38fix it so when you're identifying an APM 4:42tool to use make sure that it offers the 4:47ability to use these golden signals and 4:49this one hop dependency view so that you 4:52can quickly identify the root causes and 4:54get your service restored as quickly as 4:57possible thanks for watching this video 4:59on simplifying monitoring for modern 5:02applications 5:10you