Learning Library

← Back to Library

One Off‑by‑One Bug, Weeks Lost

4m • Unknown Channel • programming • deep-dive • intermediate • Watch on YouTube ↗

Key Points

A sporadic inconsistency in a consolidated resource view turned out to be caused by a simple off‑by‑one error (`while (x < n)` should have been `<=`), which took three weeks of ineffective code inspections to uncover.
Code inspections are useful for an initial glance but quickly lose value when the bug is subtle; thorough boundary‑condition testing is essential to catch such edge‑case mistakes.
Assuming a defect can be reproduced on demand is unrealistic—reliable logging and production‑grade debugging tools are far more effective than unit‑test debuggers for tracking intermittent issues.
When debugging asynchronous systems, load testing is critical because varying loads can change observed behavior and mask the underlying problem.

Sections

Full Transcript

# One Off‑by‑One Bug, Weeks Lost **Source:** [https://www.youtube.com/watch?v=4OGr15ikMrY](https://www.youtube.com/watch?v=4OGr15ikMrY) **Duration:** 00:04:02 ## Summary - A sporadic inconsistency in a consolidated resource view turned out to be caused by a simple off‑by‑one error (`while (x < n)` should have been `<=`), which took three weeks of ineffective code inspections to uncover. - Code inspections are useful for an initial glance but quickly lose value when the bug is subtle; thorough boundary‑condition testing is essential to catch such edge‑case mistakes. - Assuming a defect can be reproduced on demand is unrealistic—reliable logging and production‑grade debugging tools are far more effective than unit‑test debuggers for tracking intermittent issues. - When debugging asynchronous systems, load testing is critical because varying loads can change observed behavior and mask the underlying problem. ## Sections - [00:00:00](https://www.youtube.com/watch?v=4OGr15ikMrY&t=0s) **Untitled Section** - - [00:03:07](https://www.youtube.com/watch?v=4OGr15ikMrY&t=187s) **Asynchronous Debugging and Load Testing** - The speaker advises using varied load‑testing scenarios to debug asynchronous processes effectively and recommends tightly integrating—or even rotating—developers and testers to prevent the issues they experienced. ## Full Transcript

0:00Welcome to "Lessons Learned," a series where we share our 0:03biggest mistakes so you don't make the same ones. Today's lessons come from my days as a 0:08developer on operating systems. A test team had reported a problem where one of the views that 0:17would show a consolidated view of different resources was inconsistent. For example, 0:24the network drive would say there are three items. A file system driver say there's six and 0:29a CD would say there's one. But the consolidated view of that would say there was only eight. 0:35But no one could reproduce this on demand; it was semi-random. And so I decided to first try 0:43finding the problem via code inspection. So I looked at the code that was related to this, 0:47looking for logic flaws. And I'd submit a fix, but invariably a day or two later that would 0:54be rejected. And this pattern repeated literally for three weeks-- of apply a fix, get it rejected, 1:00apply a fix, get it rejected. Finally, a tester contacted me and says "Dan, I've got in the lab right 1:06now a system which is in the state that we've been trying to isolate. You should come take a look." 1:11And so I come there and I'm using a kernel level debugger to look at the state of 1:16the machine because I found that regular debuggers would actually perturb the result. 1:21And finally, after hours of debugging, I found the problem was as simple as this: 1:27while (x < n) 1:31A statement like this should have been less than or equal. 1:34So that one character cost me essentially three weeks of debugging. 1:40So what lessons did I learned from that? The first one I learned is that code inspections 1:49have their limits. It's a really good approach initially when you're just trying to look at the 1:55initial problem and maybe you're not familiar with the code. But the juice from that is not 2:00worth the squeeze very quickly. The second is bounds. And this might seem like a "programmer 101" 2:10sort of issue, but you should have better bounds testing, especially around the beginning 2:17and end of a limit. So test here, here, here and here. Of course you need test elsewhere, but those 2:24boundary conditions are the ones that invariably cause problems. The other lesson I learned is, is 2:30that the expectation that you can have a problem on demand simply is just not very realistic. 2:39It starts because people want to use debug tools that are most effective for solving problems. 2:46But in a production environment, the debug tools that you use during your initial unit 2:51testing don't really carry over well. On the other hand, logs, whether you're use it during 3:00development time or production time, provide you key information that can help you solve a problem. 3:08Another thing that I learned is that when you're doing debugging that involves 3:13asynchronous processes, you really need to focus on your load testing. Because it's asynchronous, 3:22that means that the load is going to affect what you're seeing as a result, which makes it more 3:28likely that you're going to have problems. So you should do load testing not just by people using 3:33the system, but creating test cases that drive the system at different load levels at different times. 3:39And finally, and probably one of the most important lessons I learned, 3:43is that the test and the developer team need to work closely. In fact, what I recommend is, is 3:50that they be on the same team or at a minimum that you do a tour of duty in each other's role. So 3:56that way you can learn from one of these lessons and not have to suffer like I did.