How to make a progress counter for parsing HTML with HTML::TreeBuilder

This is the true story of a trivial bug I made in a perl program yesterday.

This program parses a 3 megabyte sized HTML file using the HTML::TreeBuilder module. The program takes less than 30 seconds to run, but that'ss still boring to wait and I'd like to see whether it hangs, so I decided to add a progress counter. Now, as I haven't written all of the program yet, much of the time is currently spent in just parsing the HTML file and building a tree representation in memory from it. Thus, I needed a progress counter in the HTML parsing itself (as well as one in the rest of the program).

Before I added the progress counter, all of the HTML parsing happened in just one call of the HTML::TreeBuilder->parse_file method. If I kept that, if would be difficult to add a progress counter in it. Thus, I changed the code to instead read the HTML file in 64 kilobyte chunks, feed them each to the parser with the HTML::TreeBuilder->parse method, and print progress after each according to how much of the file is read.

I thus wrote this.

use HTML::TreeBuilder;
my $filename = ...;
my $tree = HTML::TreeBuilder->new;
{
        open my $fileh, "<", $filename or die qq(error opening input h
+tml file "$filename": $!);
        binmode $fileh;
        my $filesize = -s $fileh;
        while (read $fileh, my $buf, (1<<16)) {
                $tree->parse($buf);
                printf(STDERR "Parsing html, %2d%%;\r", int(100*tell($
+fileh)/($filesize+1)));
        }
        $tree->eof;
        print STDERR "Parsing html complete.   \n";
}
[download]

This worked fine. I got a comforting progress counter with percentages rolling quickly on the screen.

Later, however, I wanted to work around a bug in the HTML, namely some missing open tags. This can be done mechanically, because this is a generated HTML file, but it was easier if I could modify the text of the HTML before parsing it to the tree, because otherwise the tree would have a wrong shape that would be difficult to fix.

Thus, I chose to do some substitution on the text of the HTML before parsing it. This was easier by slurping the whole HTML file and doing substitutions on the whole thing. So I changed the code to slurp the file contents, substitute it, but then I still wanted to feed it to HTML::TreeBuilder in chunks to get a nice progress counter. No big deal, I wrote this.

use HTML::TreeBuilder;
my $filename = ...;
my $tree = HTML::TreeBuilder->new;
{
        printf STDERR "Reading html file.\n";
        open my $fileh, "<", $filename or die qq(error opening input h
+tml file "$filename": $!);
        binmode $fileh;
        local $/;
        my $filec = <$fileh>;
        eof($fileh) or die qq(error reading input html file);
        printf STDERR "Substing html file.\n";
        $filec =~ ...;
        my $filesize = length $filec;
        printf STDERR "Substed html has length %d\n", $filesize;
        my $filetell = 0;
        while (my$buf = substr $filec, 0, (1<<16), "") {
                $filetell += length $filec;
                $tree->parse($buf);
                printf STDERR "Parsing html: %2d%%;\r", int(100*$filet
+ell/($filesize+1));
        }
        $tree->eof;
        print STDERR "Parsing html complete.   \n";
}
[download]

This didn't work. The progress counter started showing very high numbers, going up to tens of thousands of percents. I stopped the program because I was worried it got into an infinite loop repeatedly parsing the same part of the file over and over again, and will build an infinite tree.

After a while, I found the problem. It turns out that the HTML was parsed correctly, only the progress was displayed wrong.

Can you spot the bug? I'll reveal the solution under the fold.

Comment on How to make a progress counter for parsing HTML with HTML::TreeBuilder Select or Download Code

Replies are listed 'Best First'.
Re: How to make a progress counter for parsing HTML with HTML::TreeBuilder by shmem (Chancellor) on Oct 31, 2014 at 11:42 UTC
Spotted at a glance. <Reveal this spoiler or all in this thread> And that's a good point for "peer programming" - since we write programs in different ways, we also read code in different ways (TIMTOWTDI applies also for reading), and thus are much better at spotting bugs by others than bugs perpetrated by $self. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re^2: How to make a progress counter for parsing HTML with HTML::TreeBuilder by ambrus (Abbot) on Oct 31, 2014 at 17:55 UTC
But <Reveal this spoiler or all in this thread>	[reply]
Re^3: How to make a progress counter for parsing HTML with HTML::TreeBuilder by shmem (Chancellor) on Nov 01, 2014 at 22:23 UTC
Erm... yes. <Reveal this spoiler or all in this thread> perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l]
Re^2: How to make a progress counter for parsing HTML with HTML::TreeBuilder by GotToBTru (Prior) on Nov 03, 2014 at 15:33 UTC
I was distracted by the 4 argument substr "bug". I had to update my desk reference which did not mention that option! 1 Peter 4:10	[reply]
Re: How to make a progress counter for parsing HTML with HTML::TreeBuilder by Anonymous Monk on Nov 06, 2014 at 14:40 UTC
Sorry for the noob question, but could you please explain how `(1<<16))` means "read the file in 64 kilobyte chunks"? Thanks!	[reply] [d/l]
Re^2: How to make a progress counter for parsing HTML with HTML::TreeBuilder by choroba (Cardinal) on Nov 06, 2014 at 15:09 UTC
`1<<16` is 1 shifted to the left 16 times, i.e. 1_0000_0000_0000_0000 binary, i.e. 10000 hex, i.e. 65536 dec, i.e. 64kb. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^3: How to make a progress counter for parsing HTML with HTML::TreeBuilder by Anonymous Monk on Nov 08, 2014 at 14:04 UTC
Thanks!	[reply]