Engineers’ Talk: Mr. Bug-Hunting Bero Brekalo – MES019

Hunting bugsEngineers’ Talk: Mr. Bug-Hunting: Bero Brekalo

Talking about bug-hunting has become a regular topic in this podcast. However there are tons of different persons out there using their own approach to handle bugs. All of them might have different attitudes, different approaches and different experiences. And it might be of interest to get more familiar with them and the details they can tell us.

Within today’s episode we have Bero Brekalo as a guest for interview. Bero is one of my very first listeners not directly related to me. We got acquainted after release of Episode 2. We have had a long mail-discussion about debugging, our different and common understanding. Especially Bero’s very interesting debugging tools made me curious.

As more as we discussed and as more as I got familiar with Bero, I more and more get the impression, that he’s Mr. Bug-Hunting in person. Today I have the pleasure to present you Bero Brekalo. We’re talking about many details of debugging. You get further details about his approach and his understanding of debugging. And you get acquainted with a very experienced engineer, who has gone through a lot of storms and bad weather. Let’s stay with me and enjoy the interview.

Essential Answers Provided In This Episode For:

  • Why do we need debugging?
  • How could binary tracing support you in your searching for bugs?
  • What are the essential points you need to know from your logging?
  • How does Bero’s fixed price strategy help in finding the root-cause?
  • What kind of hopeless problems is Bero confronted with?
  • Why debugging effort is regularly underestimated?
  • When should you ask for help?
  • And much much more.

Selected Links and Resources From This Episode

Thank You For Listening

Out of all the podcasts available in the Internet you tuned into mine, and I’m grateful for that. If you enjoyed the episode, please share it by using the social media buttons you see at the bottom of this note. Also, I would be very happy if you would consider taking the minute it takes to leave an honest review or rating for the podcast on iTunes or Stitcher. They’re extremely helpful when it comes to the ranking of the podcast. For sure I read every single one of them personally! Or, if you prefer a more direct contact, don't hesitate and drop me a note at feedback@embeddedsuccess.com

3 replies
  1. t-o-m-o
    t-o-m-o says:

    I find this interview very interesting, and I can relate to everything Bero was saying about binary logging and off-line interpretation. Furthermore, I would say that even textual log files in a classical form are NOT human readable. Tens of thousands of acronym infested cryptic messages are everything but human readable. For example, you can stare at the timestamps for hours and not be able to see the wood from the trees, and timestamps are the most valuable resource in the log file. I found it useful not only to parse the log files using own tools (scripts), but to transform them in order to make the problem visible in some other domain. Usually visualizing it in a graphical form in the time domain helps a lot. Jitters, delays, periodicity instantly show up. As Bero also mentioned – do the same for the healthy log, compare them graphically and reveal the differences. Sometimes the graph of visualized events (i.e system call latencies) in time domain is irregular and not informative itself, then I try for example signal processing techniques on it – I run it trough DFT algorithm and then analyse it in a frequency domain. Wow – what is that spike at 1/5 minutes frequency? My increased latencies are observed periodically exactly every 5 minutes. That was certainly not clear from the log file, as 5 minutes could be 10000 lines apart. Then you ask yourself what happens periodically every 5 minutes in our system? Ha – there is this ftp transfer process running every 5 minutes that causes the flash write I/O latency to increase – bingo! It was easy to locate it once you got that crucial 5 minutes clue, and that was obvious only in frequency domain… But you get the idea. Applying principles and techniques from other fields such as signal processing, big data, statistical analysis, modelling, indexing etc. while debugging complex systems helps a lot IMHO.

    • georg
      georg says:

      Hi Tomislav,
      Awesome idea! Treating the logs with Fourier-Transformation. Especially in cases with millions of loglines in a very short time you regularly do not see the wood for the trees. Moving to a different point of view is a very worthwhile approach. Thank you for this hint.
      I have this other combination in mind which is based on Big Data: LogStash, ElasticSearch and Kibana as a combination to handle big amount of logging and extract distinct statements. I haven’t had the chance to use it, but it has a great perspective if you’re flooded with logs and you want to get the meta-level of their meaning.

    • bero
      bero says:

      Tomo,
      I never tried using frequency domain, and never represented time-stamp results graphically. I understand that it could help to understand what was going on. I sometimes have to detect such periodicities by looking at time-stamp value differences between subsequent occurrences of the same trace-point.
      I process these trace-points and their time-stamps in different ways to tests various hypothesizes. Maybe one day I’ll add a period correlator (would be easy). The hard part is to identify the trace-points that could be interesting for such a correlator.
      Thanks for the idea. I can see its potential.

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.