Learning Library

← Back to Library

One Off‑by‑One Bug, Weeks Lost

Key Points

  • A sporadic inconsistency in a consolidated resource view turned out to be caused by a simple off‑by‑one error (`while (x < n)` should have been `<=`), which took three weeks of ineffective code inspections to uncover.
  • Code inspections are useful for an initial glance but quickly lose value when the bug is subtle; thorough boundary‑condition testing is essential to catch such edge‑case mistakes.
  • Assuming a defect can be reproduced on demand is unrealistic—reliable logging and production‑grade debugging tools are far more effective than unit‑test debuggers for tracking intermittent issues.
  • When debugging asynchronous systems, load testing is critical because varying loads can change observed behavior and mask the underlying problem.

Full Transcript

# One Off‑by‑One Bug, Weeks Lost **Source:** [https://www.youtube.com/watch?v=4OGr15ikMrY](https://www.youtube.com/watch?v=4OGr15ikMrY) **Duration:** 00:04:02 ## Summary - A sporadic inconsistency in a consolidated resource view turned out to be caused by a simple off‑by‑one error (`while (x < n)` should have been `<=`), which took three weeks of ineffective code inspections to uncover. - Code inspections are useful for an initial glance but quickly lose value when the bug is subtle; thorough boundary‑condition testing is essential to catch such edge‑case mistakes. - Assuming a defect can be reproduced on demand is unrealistic—reliable logging and production‑grade debugging tools are far more effective than unit‑test debuggers for tracking intermittent issues. - When debugging asynchronous systems, load testing is critical because varying loads can change observed behavior and mask the underlying problem. ## Sections - [00:00:00](https://www.youtube.com/watch?v=4OGr15ikMrY&t=0s) **Untitled Section** - - [00:03:07](https://www.youtube.com/watch?v=4OGr15ikMrY&t=187s) **Asynchronous Debugging and Load Testing** - The speaker advises using varied load‑testing scenarios to debug asynchronous processes effectively and recommends tightly integrating—or even rotating—developers and testers to prevent the issues they experienced. ## Full Transcript
0:00Welcome to "Lessons Learned," a series where we share our 0:03biggest mistakes so you don't make the same ones. Today's lessons come from my days as a 0:08developer on operating systems. A test team had reported a problem where one of the views that 0:17would show a consolidated view of different resources was inconsistent. For example, 0:24the network drive would say there are three items. A file system driver say there's six and 0:29a CD would say there's one. But the consolidated view of that would say there was only eight. 0:35But no one could reproduce this on demand; it was semi-random. And so I decided to first try 0:43finding the problem via code inspection. So I looked at the code that was related to this, 0:47looking for logic flaws. And I'd submit a fix, but invariably a day or two later that would 0:54be rejected. And this pattern repeated literally for three weeks-- of apply a fix, get it rejected, 1:00apply a fix, get it rejected. Finally, a tester contacted me and says "Dan, I've got in the lab right 1:06now a system which is in the state that we've been trying to isolate. You should come take a look." 1:11And so I come there and I'm using a kernel level debugger to look at the state of 1:16the machine because I found that regular debuggers would actually perturb the result. 1:21And finally, after hours of debugging, I found the problem was as simple as this: 1:27while (x < n) 1:31A statement like this should have been less than or equal. 1:34So that one character cost me essentially three weeks of debugging. 1:40So what lessons did I learned from that? The first one I learned is that code inspections 1:49have their limits. It's a really good approach initially when you're just trying to look at the 1:55initial problem and maybe you're not familiar with the code. But the juice from that is not 2:00worth the squeeze very quickly. The second is bounds. And this might seem like a "programmer  101" 2:10sort of issue, but you should have better bounds testing, especially around the beginning 2:17and end of a limit. So test here, here, here and here. Of course you need test elsewhere, but those 2:24boundary conditions are the ones that invariably cause problems. The other lesson I learned is, is 2:30that the expectation that you can have a problem on demand simply is just not very realistic. 2:39It starts because people want to use debug tools that are most effective for solving problems. 2:46But in a production environment, the debug tools that you use during your initial unit 2:51testing don't really carry over well. On the other hand, logs, whether you're use it during 3:00development time or production time, provide you key information that can help you solve a problem. 3:08Another thing that I learned is that when you're doing debugging that involves 3:13asynchronous processes, you really need to focus on your load testing. Because it's asynchronous, 3:22that means that the load is going to affect what you're seeing as a result, which makes it more 3:28likely that you're going to have problems. So you should do load testing not just by people using 3:33the system, but creating test cases that drive the system at different load levels at different times. 3:39And finally, and probably one of the most important lessons I learned, 3:43is that the test and the developer team need to work closely. In fact, what I recommend is, is 3:50that they be on the same team or at a minimum that you do a tour of duty in each other's role. So 3:56that way you can learn from one of these lessons and not have to suffer like I did.