Engineers’ Talk: Mr. Bug-Hunting Bero Brekalo – MES019
Engineers’ Talk: Mr. Bug-Hunting: Bero Brekalo
Talking about bug-hunting has become a regular topic in this podcast. However there are tons of different persons out there using their own approach to handle bugs. All of them might have different attitudes, different approaches and different experiences. And it might be of interest to get more familiar with them and the details they can tell us.
Within today’s episode we have Bero Brekalo as a guest for interview. Bero is one of my very first listeners not directly related to me. We got acquainted after release of Episode 2. We have had a long mail-discussion about debugging, our different and common understanding. Especially Bero’s very interesting debugging tools made me curious.
As more as we discussed and as more as I got familiar with Bero, I more and more get the impression, that he’s Mr. Bug-Hunting in person. Today I have the pleasure to present you Bero Brekalo. We’re talking about many details of debugging. You get further details about his approach and his understanding of debugging. And you get acquainted with a very experienced engineer, who has gone through a lot of storms and bad weather. Let’s stay with me and enjoy the interview.
Essential Answers Provided In This Episode For:
- Why do we need debugging?
- How could binary tracing support you in your searching for bugs?
- What are the essential points you need to know from your logging?
- How does Bero’s fixed price strategy help in finding the root-cause?
- What kind of hopeless problems is Bero confronted with?
- Why debugging effort is regularly underestimated?
- When should you ask for help?
- And much much more.
I find this interview very interesting, and I can relate to everything Bero was saying about binary logging and off-line interpretation. Furthermore, I would say that even textual log files in a classical form are NOT human readable. Tens of thousands of acronym infested cryptic messages are everything but human readable. For example, you can stare at the timestamps for hours and not be able to see the wood from the trees, and timestamps are the most valuable resource in the log file. I found it useful not only to parse the log files using own tools (scripts), but to transform them in order to make the problem visible in some other domain. Usually visualizing it in a graphical form in the time domain helps a lot. Jitters, delays, periodicity instantly show up. As Bero also mentioned – do the same for the healthy log, compare them graphically and reveal the differences. Sometimes the graph of visualized events (i.e system call latencies) in time domain is irregular and not informative itself, then I try for example signal processing techniques on it – I run it trough DFT algorithm and then analyse it in a frequency domain. Wow – what is that spike at 1/5 minutes frequency? My increased latencies are observed periodically exactly every 5 minutes. That was certainly not clear from the log file, as 5 minutes could be 10000 lines apart. Then you ask yourself what happens periodically every 5 minutes in our system? Ha – there is this ftp transfer process running every 5 minutes that causes the flash write I/O latency to increase – bingo! It was easy to locate it once you got that crucial 5 minutes clue, and that was obvious only in frequency domain… But you get the idea. Applying principles and techniques from other fields such as signal processing, big data, statistical analysis, modelling, indexing etc. while debugging complex systems helps a lot IMHO.
Awesome idea! Treating the logs with Fourier-Transformation. Especially in cases with millions of loglines in a very short time you regularly do not see the wood for the trees. Moving to a different point of view is a very worthwhile approach. Thank you for this hint.
I have this other combination in mind which is based on Big Data: LogStash, ElasticSearch and Kibana as a combination to handle big amount of logging and extract distinct statements. I haven’t had the chance to use it, but it has a great perspective if you’re flooded with logs and you want to get the meta-level of their meaning.
I never tried using frequency domain, and never represented time-stamp results graphically. I understand that it could help to understand what was going on. I sometimes have to detect such periodicities by looking at time-stamp value differences between subsequent occurrences of the same trace-point.
I process these trace-points and their time-stamps in different ways to tests various hypothesizes. Maybe one day I’ll add a period correlator (would be easy). The hard part is to identify the trace-points that could be interesting for such a correlator.
Thanks for the idea. I can see its potential.