So we have all probably had them. Seemingly random errors, occuring very rarely, hard or next to impossible to reproduce. All the hallmarks of.....dum dum dah.....
It started like this. We have an app the interfaces with squid via the redirector hooks. It deals with URLs. Very early on we noticed a single URL that looked like this:
$A¬F$A¬Fww.hotmail.com
We noticed it because it casues an undefined warning as the URL parsing code had not been designed to deal with it. What you could see was that the first 8 bytes of the URL had been replaced by 2 4 byte binary words (the same word as it happened). Now the data centre that hosts the servers had had a power cut and switch to generators on the day in quiestion so we hoped that is was a transient glitch, and optimistically fixed the URL parsing code to deal with corrupt cases gracefully. Then we basically thought no more of it. Actually I had one of those sneaking suspicions that that we had not got to the bottom of the problem and it would be back. I'm sure you know what I mean.....
So time passes, development continues and everything is fine. We don't see any more odd looking URLs, until.....
Suddenly it is back. Not one but hundreds. OK so we have a real problem. Looks kinda C-y with 2 random words but where to start?
Simply hunting down a test case proved to be the first problem. It appeared to be related to a particular website (which had about 300 <img src....> tags in it to load a trillion little images to form the page. But when I went to the site the problem never occurred. When one of my remotely located developers went there it occured not every time but only if he hit refresh several times in close succession. Anyway we finally had a test case or sorts. After looking at browser, etc issues it dawned on us that the only real difference was the connection speed. His was 4 times faster than mine. Combined with the need to use refresh to hit it really hard we knew it was load related. Approacing production stage about the last thing I really wanted to know was that the system developed faults under load! Of course none of the load testing to date had showed it up.....
The search for the casue of this bug went on for 3 days. Everyone downed tools. We looked through all our code, added even more debugging code to squid, recompiled, etc, etc. We could see the error occuring and track its progress but the cause remained obscure. Finally I was looking at our redirector code in a state of increasing despair and I finally noticed this:
sub dummy_ad { ( $file_ext ) = @_; print "$DUMMY_GIF\n" if $file_ext =~ m/gif/i; print "$DUMMY_JPG\n" if $file_ext =~ m/jpg|jpeg/i; print "$DUMMY_PNG\n" if $file_ext =~ m/png/i; print "$DUMMY_HTM\n"; }
Now the way that a redirector hanging off squid works is that it hands the redirector process one line on STDIN and expects one line back on STDOUT. You can send it the same url back (no redirection) or a new one (redirects request to that url). The code in question was being used for ad blocking and had the most trivial of errors in that it would print 2 lines to STDOUT if the file ext was /gif|jpe?g|png/. As squid only expected one line that is all it read leaving unread data hanging on the pipe. Squid actaully has mechanisms to deal with this but they are not 100% reliable and break down under high load.
Knowing what the problem was it was easy enough to fix and trace the flow of events that lead to those 2 dwords in the data. But now to the worst part: I wrote the original buggy code and fixed it about 4 hours later as soon as I noted the undesired effect. That was some 6 months earlier so somewhat lost in the mists of prehistory. However what I did not do was write a test case to prove it was fixed or know that one of the developers had already downloaded a copy of the buggy code (during that 4 hour or so window) which promptly stomped on the bug fix when it got committed.
cheers
tachyon
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |