In a world of bugs – how to become a successful bug-hunter? – MES002
In this episode we will discuss why bug-hunters are scarce resources. And how to improve your skills to become a more successful bug-hunter, in the Embedded World.
I will give you my 7 main bullets to improve your bug-hunting skills to a new level.
Why do we need bug-hunters?
A successful hunter of hardware- and software-bugs is no profession or even education. Most of all educations are either ignoring the fact of bugs or they are simply assuming that you or others do not make errors.
The only thing you might receive or you are forced to experience is the training on the job. The knowledge about bug-hunting is mainly empirical. That means you need a lot of intention and also experience to become better. Many of us have undergo this painful and stony way.
This episode is about presenting you my experiences which I have collected over the years. I will tell you some guts to become a solid bug-hunter much more faster and more reliably. Moreover following these tracks might even let you think more consciously about preventing bugs at all.
What’s missing for bug-hunters?
Guidelines and the distribution of knowledge. There is no regular exchange of experiences and occurrences of failures. Most engineers only learn their from their own mistakes. However there are ten thousands of others who might share their experiences, too.
Even the more experienced engineers will take the chance to improve their own bug-hunting approaches. If they would need others! But there is no distribution. There is no sharing in a bigger audience.
I was confronted with a lot of bugs and problems within my professional life. Perhaps you’ve experienced the same. Let’s use this episode to share our experiences and participate from each other to become more effective and more efficient engineers.
Let’s take this opportunity and collect some bug-hunting approaches and learn from each other.
How to improve your bug-hunting skills
Use my list of seven bullet points to become a successful bug-hunter. These details are the result of more than two decades of bug-hunting and bug-prevention.
Two different general approaches to look out for bugs
- The destructive approach by instrumenting your code or drilling down the functions
- The Non destructive approach by comparing or jumping back in revision history and getting the big picture about the constraints and real requirements of the system
The detective’s mindset – feel the excitement of hunting
These are my six main factors of becoming a real detective:
- Collect facts
- Collect suspects
- Do not trust witnesses
- Create your own track
- If nothing runs, switch to Sherlock Holmes mode
- Play the devil’s advocate or Good Cop – Bad Cop game
Change your mindset and assume the system behaves correctly
- What’s wrong in your thinking if the system is correct?
- How do the code-path looks like if the observed situation is assumed as correct?
- Where is the thinking loop which prevents you from achieving the desired outcome?
Use internal resources and do not fight alone.
- Use your internal skill matrix.
- Expand your personal who’s who.
Do not underestimate the time for failure maintenance
Avoid time constraints and customer pressure. Bug-fixing costs time; very often a lot of time – which blocks you from other activities
Narrate your bugs and findings
Let others participate. Very often bugs are tried to be hidden. Especially then they are assumed as embarrassing or rookie-bugs. But do not be shy. Make the first step and announce your findings officially having the bigger success of the whole company in mind. See the Japanese approach of Best Practice Sharing, the Yokoten-kai (alternatively here).
There is no multitasking in human brains
Very often overseen, very often thwarted. Newest research have unveiled that every context switch needs at best 30 minutes to come back to the original position in your thinking. This is a call to all leading persons. Do not burn up your engineers’ brain-power by pushing them from one topic to another. Let them concentrate. On one issue. At one time. For at least 4-8 hours.
Now I’d love to hear from you what’s your experience in hunting bugs in software and hardware?
- How do you do bug-hunting?
- What are your preferred approaches for bug-hunting?
- What kind of challenges do you observe in being a bug-hunter?
Do you have a habit or usage that I didn’t list here? Or do you wanna agree or disagree with me about these approaches? I’d love to hear from you. Please comment on the show-notes at the embeddedsuccess.com/episode02. And let me know your experience, your thinking and what you’re using and how you’re using it.
I use the “destructive approach” by instrumenting the code with binary traces. This turns out to be non-destructive.
The developer indicates what needs to be traced inside eclipse by clicking on the “lines of interest” and selects what variables must be traced.
The original code stays untouched until it will get compiled (at that point binary trace macro’s are inserted. The information about what should be traced is kept in a trace description (contains all data needed to make sense of the binary data collected on the target (binary traces), for all different source files. Just before (part of the build process) compiling, the traces are inserted.
For different experiments I can have different trace descriptions.
Binary tracing is quite non-intrusive (typically 4-7 instructions to trace one location + 5-10 for each variable).
This tracing can be done inside interrupt handlers, no locking is needed on target, because the host will deal with nesting inside a trace point if the tracing action is interrupted by an other one that produces traces.
Depending on HW and the problem at hand traces are collected in a (small ~2KB or large (several MB)) circular buffer.
At the end of the experiment, traces are extracted and processed on a development PC.
Processing on host is usually inside eclipse where you can walk forward and backward through your trace displaying variables that are traced or that can be deduced from the ones that are traced. While walking through your trace you can enable or disable some trace-points.
The stack is also displayed at each trace-point by using information that is traced together with the call-graph that is reconstructed from the sources.
Of course you can also print the traces in a predefined or user defined format.
thanks a lot for your comments.
I have seen and used binary tracing on machine-control systems right midst inside the IRQ-handling (as mentioned by you, too). It does not cause a big impact as regularly the binary data is not changed or maintained in any way, simply copied into some DMA-/dual-buffer-RAM or similar. From there it is copied after runtime or during idle-time to the host-system. And first here the manipulation or interpretation of the binary data is made.
But do I understand correctly that for your approach you need a compile/link/build-cycle for every change in the configuration of the debugging? What are your experiences with roundtrip-cycle times? And do you let the modified code in place for official releasing? I mean, even it is a very small intrusion, it still has some impact on runtime.
I have not that much experience with Eclipse – see I’m a vi-fellower 🙂 – but it sounds you’re using some kind of plugin for manipulating the code/source for this debugging purpose. What’s it?
You are correct, when you change something concerning tracing (add/remove trace-point, add/remove variables to be traced) you have to go through the compile/link/( image creation) cycle, after the tool has updated the files with trace modifications. In my experience this takes less than 5 minutes (in case of Linux kernel rebuild (only modified files are compiled). Time needed to upload it onto the target depends on the interface between target HW and your build environment.
Concerning the traces inside the final release I have two customers where the trace collection (into a circular buffer) is in the delivered code, because they want to be able to debug future problems that occur in delivered systems.
For most customers the traces are inserted to understand and identify the (root) cause of their problem (like crashes, hangs, unexplainable behavior, …). After the problem is understood and solved, the traces are used to prove that the changed code handles the problem correctly (when it occurs). After that a build without traces is tested extensively and is delivered.
Most customers want to keep the trace description file(s) and the collected binary traces, in case that some further issues are detected. They usually place these files under their version control system.
One such trace description file contains all the info about the trace points of all involved source files and can be reused again if needed. Even if the customer modifies the code the tool will be able to (automatically and/or interactively) merge the traces into the modified files.
After the traces are collected on target and transferred to the host, they are processed using their trace description file.
During this processing the binary file
– is linearized :
– oldest record could have been partially overwritten
– while writing one trace record, an interrupt may have occurred that places its trace record in the middle of an other one
– prints into a file a humanly readable trace log
– for each trace-point count how often it occurred (you don’t want to waste time going through a trace, just to find out that something didn’t happen)
– reconstruct call-stack for each trace record (combines static code analysis (call graph) and combines it with data from the collected trace)
– for each trace record extract values of traced variables and deduce the values for non-traced variables (if possible)
After this step the you open eclipse, and get the opportunity to load a previously saved collections of Points Of Interest (places you would like to look first). You can add/remove POI’s and start the investigation.
During the investigation you can step forward and backward between previous/next trace records or POI’s
Most people that I have met, still use printf based logging, and spend huge amount of time to do things that could be automated, because they don’t want to invest in automating things, their customer doesn’t want to pay for tool development, and they (and their boss) don’t want to invest unpaid time into tooling.
The tool (plugin for eclipse and external programs) evolved over years to include automation for tasks that are needed when investigating problems. Currently I’m dealing with multicore tracing.
Example: When you see something in an log file (binary or text) you very often want to know how you got there and what are the values of some variables.
Doing that by hand is very tedious and error prone (especially if interrupts may intervene and …). The algorithm that you execute is relatively simple (interrupted flows are a bit trickier but …), this is perfect candidates to be automated.
If you want to get some screenshots you can send me an e-mail (you know my e-mail).
once again thank you for your verbose explanation and brief description.
I can imagine that customers accept the debug-code to stay in place – the benefit they get in case of crashes or problems is worth the price anyway.
I have read about this binary debugging approach first time in Dr. Dobbs Journal somewhen in the 80’s of the last century. It was not at that time of course not that elaborated and mature as you’re describing it.
Sadly in the mean time I was confronted with a lot of weird approaches and attitudes belonging logging and debugging. You have already mentioned the printf() approach which seems to be not able to be wiped out. Or with gigabytes of logfiles containing all and every detail, followed by an architect’s statement “grep is your friend”. Gosh, I was really bewildered. That’s not only a waste of effort but also a significant decline in the possible performance.
In my opinion the binary-approach is a great approach for developers to have their debugging-data available on the spot. However I have seen rather big projects with hundreds of testers all running different test-cases with different test-conditions and harnesses. They all needed a different setup of the logging system as it was used to indicate the system’s correct behavior. This might be not able to realize with the binary approach.
Logging concepts are one of my favorites. I am rather passionate because I was responsible for years in logging and debugging subsystems and I saw rather awkward things. Here I want to provide some details to improve approaches in logging. The corresponding episode will be ready sometime in late summer (northern hemisphere 🙂 ) or beginning of autumn.
The way I implemented binary traces where the trace description is not part of the code but is kept in separate file. This allows each of the testers to have their own test description and hence the generated code.
If you have two tests that you want to run each requiring a different tracepoints they would just get two different trace descriptions and do their run independently. After the test has been run each of them would get his trace and description. There is no problem of scaling the binary approach.
oh, that puzzles me now. A binary-trace implementation that is that versatile and fully configurable during runtime? Wouldn’t that mean you have some general hooks implemented? Like for memory-allocation/-deallocation checking done by valgrind? Or is simply every line of code instrumented?
Traces are inserted at compile time. The flexibility/configurability is provided the fact that you can have different instrumentations for the same source code, and for the same trace produced by one trace different people can have different POI (trace filters). I use that regularly when cooperating with my colleagues:
– A thinks that the problem could be bets traced in one way, the B thinks that an other set of traces would be better. Each instruments the code in his way, both are compiled and run producing two different traces (results of two different runs).
– In case of one trace produced by one run, A and B could have a different theory of what went wrong or want to investigate part of the code that they are familiar with. In that case A and B will have two different trace filters. However when A thinks that he has found something suspicious in the other part, he can provide (parts of) his filter to B, and B can then cooperate with the investigation. remember that you can “run” a trace from POI to POI and over all trace records (unfiltered) forward and backward.
I hope that this explains how the flexibility using trace definition and trace filtering allows different people to have a different focus.
Because the trace descriptions are kept outside the source files, different tracing can easily implemented. By using different trace filters the analysis of traces can be differentiated as well.