Re: how to quickly parse 50000 html documents?

There was a discussion thread about this a couple of weeks ago.

Short answer:

Don't use bare regular expressions unless the html pages are very simple and very consistent, as you will drive yourself mad trying to catch all the corner cases, and you will end up writing a buggy HTML parser.
Instead go to CPAN and download an HTML parser that someone else has already written and debugged. HTML::TreeBuilder and HTML::TokeParser::Simple both come Highly recommended
You should use a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.

PS: It won't be that quick. In my experience, HTML::TreeBuilder takes a significant fraction of a second to load an average HTML document, so multiply that by 50_000 files and you are looking at a day or so of processing time. Also you need to explicitly delete html trees once you are done with them, as the data structures generated by HTML::TreeBuilder will not be freed automatically when they go out of scope, and each will consume several megabytes of RAM.

Comment on Re: how to quickly parse 50000 html documents? Download Code

Replies are listed 'Best First'.
Re^2: how to quickly parse 50000 html documents? by brengo (Acolyte) on Nov 25, 2010 at 22:20 UTC
Thanks for all the links! Yes, I already looked a bit at HTML::TreeBuilder but as I understand zilch of it now I wanted to be sure that this is the right tool. The fact that the trees look powerful but with a huge overhead made me ask whether a regex would be faster. The web pages look consistent and a regex to give the number after the second occurence of "drill width:" should do fine. Using a combination of grep\|sed\|tr\|, each looped over each of the variables and all the files, my crapcode takes about 28hrs right now so everything that makes it faster is welcome.	[reply]
Re^3: how to quickly parse 50000 html documents? by chrestomanci (Priest) on Nov 26, 2010 at 10:02 UTC
I would agree that HTML::TreeBuilder looks daunting, but it is not that hard to use once you are used to it. Here is a snippet from a script I wrote recently that uses HTML::TreeBuilder to pull some data out of a table. (Feel free to copy it if you like.) sub parseResPage { my ( $rawHTML ) = @_; my $tree = HTML::TreeBuilder->new_from_content( $rawHTML ); my @tables = $tree->look_down('_tag', 'table'); # We wa +nt the second table my @tableRows = $tables[1]->look_down('_tag', 'tr'); # First ro +w is headings, then the data my $headRow = shift @tableRows; my @headings; my $res_hash; my @cells = $headRow->look_down('_tag', 'td'); push @headings, $_->as_text() foreach (@cells); foreach my $mainRow ( @tableRows ) { my @cells = $mainRow->look_down('_tag', 'td'); my $iface = $cells[0]->as_text(); for( my $i=0; $i<scalar@cells; $i++ ) { $res_hash->{$iface}{ $headings[$i] } = $cells[$i]->as_text +(); } } # Explicity free the memory consumed by the tree. $tree->delete(); return $res_hash; } [download] Tip: If you are not already familiar with the perl command line debugger then now is the time to learn. When I am working with HTML::TreeBuilder code, my usual approach is to write a script that just loads the tree and sets a break point afterwards, and then start running `$tree->look_down()` commands interactively until I find a combination that gives me what I am looking for. I then paste that back into my editor and use it in my script. I suspect that if you write a script that uses HTML::TreeBuilder then it will probably end up being slower than your simple grep based script. HTML::TreeBuilder is well optimised perl written by some clever people, but it contains lots code to handle malformed HTML, and other corner cases, so it will be slower than a simple regular expression based script. Why are you so concerned about speed anyway? How much time have you spent on writing these scripts already?	[reply] [d/l] [select]