So we have all probably had them. Seemingly random errors, occuring very rarely, hard or next to impossible to reproduce. All the hallmarks of.....dum dum dah.....

The bug from hell.

It started like this. We have an app the interfaces with squid via the redirector hooks. It deals with URLs. Very early on we noticed a single URL that looked like this:

$A¬F$A¬Fww.hotmail.com

We noticed it because it casues an undefined warning as the URL parsing code had not been designed to deal with it. What you could see was that the first 8 bytes of the URL had been replaced by 2 4 byte binary words (the same word as it happened). Now the data centre that hosts the servers had had a power cut and switch to generators on the day in quiestion so we hoped that is was a transient glitch, and optimistically fixed the URL parsing code to deal with corrupt cases gracefully. Then we basically thought no more of it. Actually I had one of those sneaking suspicions that that we had not got to the bottom of the problem and it would be back. I'm sure you know what I mean.....

So time passes, development continues and everything is fine. We don't see any more odd looking URLs, until.....

Suddenly it is back. Not one but hundreds. OK so we have a real problem. Looks kinda C-y with 2 random words but where to start?

Simply hunting down a test case proved to be the first problem. It appeared to be related to a particular website (which had about 300 <img src....> tags in it to load a trillion little images to form the page. But when I went to the site the problem never occurred. When one of my remotely located developers went there it occured not every time but only if he hit refresh several times in close succession. Anyway we finally had a test case or sorts. After looking at browser, etc issues it dawned on us that the only real difference was the connection speed. His was 4 times faster than mine. Combined with the need to use refresh to hit it really hard we knew it was load related. Approacing production stage about the last thing I really wanted to know was that the system developed faults under load! Of course none of the load testing to date had showed it up.....

The search for the casue of this bug went on for 3 days. Everyone downed tools. We looked through all our code, added even more debugging code to squid, recompiled, etc, etc. We could see the error occuring and track its progress but the cause remained obscure. Finally I was looking at our redirector code in a state of increasing despair and I finally noticed this:

sub dummy_ad { ( $file_ext ) = @_; print "$DUMMY_GIF\n" if $file_ext =~ m/gif/i; print "$DUMMY_JPG\n" if $file_ext =~ m/jpg|jpeg/i; print "$DUMMY_PNG\n" if $file_ext =~ m/png/i; print "$DUMMY_HTM\n"; }

Now the way that a redirector hanging off squid works is that it hands the redirector process one line on STDIN and expects one line back on STDOUT. You can send it the same url back (no redirection) or a new one (redirects request to that url). The code in question was being used for ad blocking and had the most trivial of errors in that it would print 2 lines to STDOUT if the file ext was /gif|jpe?g|png/. As squid only expected one line that is all it read leaving unread data hanging on the pipe. Squid actaully has mechanisms to deal with this but they are not 100% reliable and break down under high load.

Knowing what the problem was it was easy enough to fix and trace the flow of events that lead to those 2 dwords in the data. But now to the worst part: I wrote the original buggy code and fixed it about 4 hours later as soon as I noted the undesired effect. That was some 6 months earlier so somewhat lost in the mists of prehistory. However what I did not do was write a test case to prove it was fixed or know that one of the developers had already downloaded a copy of the buggy code (during that 4 hour or so window) which promptly stomped on the bug fix when it got committed.

cheers

tachyon

Replies are listed 'Best First'.
Re: That nagging feeling and the little voice in the back of your head
by dws (Chancellor) on Dec 22, 2003 at 07:38 UTC
    However what I did not do was write a test case to prove it was fixed or know that one of the developers had already downloaded a copy of the buggy code (during that 4 hour or so window) which promptly stomped on the bug fix when it got committed.

    CVS?

    Adding insult to injury, I once fixed a bug and added a unit test, only to have both overwritten.

    But that's a rarity. 99%+ of the time, adding a unit test pays off not only in detected when one particular problem regresses, but also in catching problems as they percolate. The last few times I tracked down problems in a medium-sized J2EE application, I started by running the full set of unit tests. By resolving any failing tests first, even if the tests seemed to have nothing to do with the symptoms we were seeing, problems seemed to voluntary step forward waving surrender flags.

Re: That nagging feeling and the little voice in the back of your head
by liz (Monsignor) on Dec 22, 2003 at 08:55 UTC
    I think there'e a lesson to be learnt here. I always try to program in such a way that CPU usage is at a minimum. I would therefore have programmed the sub in this way from the beginning:
    sub dummy_ad { my $file_ext = lc $_[0]; if ($file_ext =~ m/gif/) { print "$DUMMY_GIF\n"; } elsif ($file_ext =~ m/jpg|jpeg/) { print "$DUMMY_JPG\n"; } elsif ($file_ext =~ m/png/) { print "$DUMMY_PNG\n"; } else { print "$DUMMY_HTM\n"; } }
    I would probably have done some statistical analysis on which extensions were the most common and use that extension in the first check. I'm also not sure whether I would have used a match or not: normally I would use eq in these cases.

    In any case, I think that too much laziness was the real cause of the problem here. Something, I must admit, I also have been guilty of at times and which I try to prevent exactly because of these kinds of problems.

    Glad though that the problem was fixed. There's nothing more troubling than that nagging feeling that your bug fixing was not good enough. And you don't want that in the holiday season!

    Merry XMas!

    Liz

    Update:
    Just realized I would have probably used a hash lookup if I could have used eq to check extensions:

    %DUMMY_AD = ( gif => "$DUMMY_GIF\n", jpg => "$DUMMY_JPG\n", jpeg => "$DUMMY_JPG\n", png => "$DUMMY_PNG\n", ); sub dummy_ad { print $DUMMY_AD{lc $_[0]} || "$DUMMY_HTM\n"; }
Re: That nagging feeling and the little voice in the back of your head
by cLive ;-) (Prior) on Dec 22, 2003 at 15:29 UTC

    ...or know that one of the developers had already downloaded a copy of the buggy code (during that 4 hour or so window) which promptly stomped on the bug fix when it got committed.

    I'm a little confused by this. Why did the other developer download the script from the live site rather than from CVS (I assume you're using CVS - either that or, when committed, the script is now in some insane asylum somewhere :). If they had retrieved it from CVS, they/you would have seen the conflict at commit time.

    What we've found that works is a BSD jail for each developer, and commit direct from the jail or from their machine. That way, we don't touch live code until we have to. Usually, we also need a "hack script" to set up the jail so that it mimics the live environment, but thats often no more than a few soft links to modules etc.

    Our versioning system used to be:

    1. turn around
    2. ask out loud, "Is anyone working on XXX today?"
    3. if no, proceed to do work on test server
    4. if yes, find out what files they were working on and avoid them like the plague

    This actually worked quite well until we went from one room to three rooms :) - but is a method I do NOT endorse anymore - even if you're the only person working on a project.

    Perhaps you can gain brownie points for suggesting amendments to your current practice? Or just give that other developer a slap on the wrist and shout "USE CVS" at them a few times?

    .02

    cLive ;-)

      Our versioning system used to be.....

      We were using the same sort of 'system' at the time. This was only about 2 weeks into the project and pre CVS. We were really just playing with a few ideas at the time.

      cheers

      tachyon