This is the true story of a trivial bug I made in a perl program yesterday.

This program parses a 3 megabyte sized HTML file using the HTML::TreeBuilder module. The program takes less than 30 seconds to run, but that'ss still boring to wait and I'd like to see whether it hangs, so I decided to add a progress counter. Now, as I haven't written all of the program yet, much of the time is currently spent in just parsing the HTML file and building a tree representation in memory from it. Thus, I needed a progress counter in the HTML parsing itself (as well as one in the rest of the program).

Before I added the progress counter, all of the HTML parsing happened in just one call of the HTML::TreeBuilder->parse_file method. If I kept that, if would be difficult to add a progress counter in it. Thus, I changed the code to instead read the HTML file in 64 kilobyte chunks, feed them each to the parser with the HTML::TreeBuilder->parse method, and print progress after each according to how much of the file is read.

I thus wrote this.

use HTML::TreeBuilder; my $filename = ...; my $tree = HTML::TreeBuilder->new; { open my $fileh, "<", $filename or die qq(error opening input h +tml file "$filename": $!); binmode $fileh; my $filesize = -s $fileh; while (read $fileh, my $buf, (1<<16)) { $tree->parse($buf); printf(STDERR "Parsing html, %2d%%;\r", int(100*tell($ +fileh)/($filesize+1))); } $tree->eof; print STDERR "Parsing html complete. \n"; }

This worked fine. I got a comforting progress counter with percentages rolling quickly on the screen.

Later, however, I wanted to work around a bug in the HTML, namely some missing open tags. This can be done mechanically, because this is a generated HTML file, but it was easier if I could modify the text of the HTML before parsing it to the tree, because otherwise the tree would have a wrong shape that would be difficult to fix.

Thus, I chose to do some substitution on the text of the HTML before parsing it. This was easier by slurping the whole HTML file and doing substitutions on the whole thing. So I changed the code to slurp the file contents, substitute it, but then I still wanted to feed it to HTML::TreeBuilder in chunks to get a nice progress counter. No big deal, I wrote this.

use HTML::TreeBuilder; my $filename = ...; my $tree = HTML::TreeBuilder->new; { printf STDERR "Reading html file.\n"; open my $fileh, "<", $filename or die qq(error opening input h +tml file "$filename": $!); binmode $fileh; local $/; my $filec = <$fileh>; eof($fileh) or die qq(error reading input html file); printf STDERR "Substing html file.\n"; $filec =~ ...; my $filesize = length $filec; printf STDERR "Substed html has length %d\n", $filesize; my $filetell = 0; while (my$buf = substr $filec, 0, (1<<16), "") { $filetell += length $filec; $tree->parse($buf); printf STDERR "Parsing html: %2d%%;\r", int(100*$filet +ell/($filesize+1)); } $tree->eof; print STDERR "Parsing html complete. \n"; }

This didn't work. The progress counter started showing very high numbers, going up to tens of thousands of percents. I stopped the program because I was worried it got into an infinite loop repeatedly parsing the same part of the file over and over again, and will build an infinite tree.

After a while, I found the problem. It turns out that the HTML was parsed correctly, only the progress was displayed wrong.

Can you spot the bug? I'll reveal the solution under the fold.

Replies are listed 'Best First'.
Re: How to make a progress counter for parsing HTML with HTML::TreeBuilder
by shmem (Chancellor) on Oct 31, 2014 at 11:42 UTC

    Spotted at a glance.

    And that's a good point for "peer programming" - since we write programs in different ways, we also read code in different ways (TIMTOWTDI applies also for reading), and thus are much better at spotting bugs by others than bugs perpetrated by $self.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      I was distracted by the 4 argument substr "bug". I had to update my desk reference which did not mention that option!

      1 Peter 4:10
Re: How to make a progress counter for parsing HTML with HTML::TreeBuilder
by Anonymous Monk on Nov 06, 2014 at 14:40 UTC

    Sorry for the noob question, but could you please explain how  (1<<16)) means "read the file in 64 kilobyte chunks"? Thanks!

      1<<16 is 1 shifted to the left 16 times, i.e. 1_0000_0000_0000_0000 binary, i.e. 10000 hex, i.e. 65536 dec, i.e. 64kb.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Thanks!