One Off‑by‑One Bug, Weeks Lost
Key Points
- A sporadic inconsistency in a consolidated resource view turned out to be caused by a simple off‑by‑one error (`while (x < n)` should have been `<=`), which took three weeks of ineffective code inspections to uncover.
- Code inspections are useful for an initial glance but quickly lose value when the bug is subtle; thorough boundary‑condition testing is essential to catch such edge‑case mistakes.
- Assuming a defect can be reproduced on demand is unrealistic—reliable logging and production‑grade debugging tools are far more effective than unit‑test debuggers for tracking intermittent issues.
- When debugging asynchronous systems, load testing is critical because varying loads can change observed behavior and mask the underlying problem.
Sections
Full Transcript
# One Off‑by‑One Bug, Weeks Lost **Source:** [https://www.youtube.com/watch?v=4OGr15ikMrY](https://www.youtube.com/watch?v=4OGr15ikMrY) **Duration:** 00:04:02 ## Summary - A sporadic inconsistency in a consolidated resource view turned out to be caused by a simple off‑by‑one error (`while (x < n)` should have been `<=`), which took three weeks of ineffective code inspections to uncover. - Code inspections are useful for an initial glance but quickly lose value when the bug is subtle; thorough boundary‑condition testing is essential to catch such edge‑case mistakes. - Assuming a defect can be reproduced on demand is unrealistic—reliable logging and production‑grade debugging tools are far more effective than unit‑test debuggers for tracking intermittent issues. - When debugging asynchronous systems, load testing is critical because varying loads can change observed behavior and mask the underlying problem. ## Sections - [00:00:00](https://www.youtube.com/watch?v=4OGr15ikMrY&t=0s) **Untitled Section** - - [00:03:07](https://www.youtube.com/watch?v=4OGr15ikMrY&t=187s) **Asynchronous Debugging and Load Testing** - The speaker advises using varied load‑testing scenarios to debug asynchronous processes effectively and recommends tightly integrating—or even rotating—developers and testers to prevent the issues they experienced. ## Full Transcript
Welcome to "Lessons Learned," a series where we share our
biggest mistakes so you don't make the same ones. Today's lessons come from my days as a
developer on operating systems. A test team had reported a problem where one of the views that
would show a consolidated view of different resources was inconsistent. For example,
the network drive would say there are three items. A file system driver say there's six and
a CD would say there's one. But the consolidated view of that would say there was only eight.
But no one could reproduce this on demand; it was semi-random. And so I decided to first try
finding the problem via code inspection. So I looked at the code that was related to this,
looking for logic flaws. And I'd submit a fix, but invariably a day or two later that would
be rejected. And this pattern repeated literally for three weeks-- of apply a fix, get it rejected,
apply a fix, get it rejected. Finally, a tester contacted me and says "Dan, I've got in the lab right
now a system which is in the state that we've been trying to isolate. You should come take a look."
And so I come there and I'm using a kernel level debugger to look at the state of
the machine because I found that regular debuggers would actually perturb the result.
And finally, after hours of debugging, I found the problem was as simple as this:
while (x < n)
A statement like this should have been less than or equal.
So that one character cost me essentially three weeks of debugging.
So what lessons did I learned from that? The first one I learned is that code inspections
have their limits. It's a really good approach initially when you're just trying to look at the
initial problem and maybe you're not familiar with the code. But the juice from that is not
worth the squeeze very quickly. The second is bounds. And this might seem like a "programmer 101"
sort of issue, but you should have better bounds testing, especially around the beginning
and end of a limit. So test here, here, here and here. Of course you need test elsewhere, but those
boundary conditions are the ones that invariably cause problems. The other lesson I learned is, is
that the expectation that you can have a problem on demand simply is just not very realistic.
It starts because people want to use debug tools that are most effective for solving problems.
But in a production environment, the debug tools that you use during your initial unit
testing don't really carry over well. On the other hand, logs, whether you're use it during
development time or production time, provide you key information that can help you solve a problem.
Another thing that I learned is that when you're doing debugging that involves
asynchronous processes, you really need to focus on your load testing. Because it's asynchronous,
that means that the load is going to affect what you're seeing as a result, which makes it more
likely that you're going to have problems. So you should do load testing not just by people using
the system, but creating test cases that drive the system at different load levels at different times.
And finally, and probably one of the most important lessons I learned,
is that the test and the developer team need to work closely. In fact, what I recommend is, is
that they be on the same team or at a minimum that you do a tour of duty in each other's role. So
that way you can learn from one of these lessons and not have to suffer like I did.