This is the true story of a trivial bug I made in a perl program yesterday.
This program parses a 3 megabyte sized HTML file using the HTML::TreeBuilder module. The program takes less than 30 seconds to run, but that'ss still boring to wait and I'd like to see whether it hangs, so I decided to add a progress counter. Now, as I haven't written all of the program yet, much of the time is currently spent in just parsing the HTML file and building a tree representation in memory from it. Thus, I needed a progress counter in the HTML parsing itself (as well as one in the rest of the program).
Before I added the progress counter, all of the HTML parsing happened in just one call of the HTML::TreeBuilder->parse_file method. If I kept that, if would be difficult to add a progress counter in it. Thus, I changed the code to instead read the HTML file in 64 kilobyte chunks, feed them each to the parser with the HTML::TreeBuilder->parse method, and print progress after each according to how much of the file is read.
I thus wrote this.
use HTML::TreeBuilder; my $filename = ...; my $tree = HTML::TreeBuilder->new; { open my $fileh, "<", $filename or die qq(error opening input h +tml file "$filename": $!); binmode $fileh; my $filesize = -s $fileh; while (read $fileh, my $buf, (1<<16)) { $tree->parse($buf); printf(STDERR "Parsing html, %2d%%;\r", int(100*tell($ +fileh)/($filesize+1))); } $tree->eof; print STDERR "Parsing html complete. \n"; }
This worked fine. I got a comforting progress counter with percentages rolling quickly on the screen.
Later, however, I wanted to work around a bug in the HTML, namely some missing open tags. This can be done mechanically, because this is a generated HTML file, but it was easier if I could modify the text of the HTML before parsing it to the tree, because otherwise the tree would have a wrong shape that would be difficult to fix.
Thus, I chose to do some substitution on the text of the HTML before parsing it. This was easier by slurping the whole HTML file and doing substitutions on the whole thing. So I changed the code to slurp the file contents, substitute it, but then I still wanted to feed it to HTML::TreeBuilder in chunks to get a nice progress counter. No big deal, I wrote this.
use HTML::TreeBuilder; my $filename = ...; my $tree = HTML::TreeBuilder->new; { printf STDERR "Reading html file.\n"; open my $fileh, "<", $filename or die qq(error opening input h +tml file "$filename": $!); binmode $fileh; local $/; my $filec = <$fileh>; eof($fileh) or die qq(error reading input html file); printf STDERR "Substing html file.\n"; $filec =~ ...; my $filesize = length $filec; printf STDERR "Substed html has length %d\n", $filesize; my $filetell = 0; while (my$buf = substr $filec, 0, (1<<16), "") { $filetell += length $filec; $tree->parse($buf); printf STDERR "Parsing html: %2d%%;\r", int(100*$filet +ell/($filesize+1)); } $tree->eof; print STDERR "Parsing html complete. \n"; }
This didn't work. The progress counter started showing very high numbers, going up to tens of thousands of percents. I stopped the program because I was worried it got into an infinite loop repeatedly parsing the same part of the file over and over again, and will build an infinite tree.
After a while, I found the problem. It turns out that the HTML was parsed correctly, only the progress was displayed wrong.
Can you spot the bug? I'll reveal the solution under the fold.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to make a progress counter for parsing HTML with HTML::TreeBuilder
by shmem (Chancellor) on Oct 31, 2014 at 11:42 UTC | |
by ambrus (Abbot) on Oct 31, 2014 at 17:55 UTC | |
by shmem (Chancellor) on Nov 01, 2014 at 22:23 UTC | |
by GotToBTru (Prior) on Nov 03, 2014 at 15:33 UTC | |
|
Re: How to make a progress counter for parsing HTML with HTML::TreeBuilder
by Anonymous Monk on Nov 06, 2014 at 14:40 UTC | |
by choroba (Cardinal) on Nov 06, 2014 at 15:09 UTC | |
by Anonymous Monk on Nov 08, 2014 at 14:04 UTC |