Slow regexp

cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
I'm no expert at regexps. I've done the following code to get some SEO details from the page of HTML. The only problem is it isn't fast. In fact it's plain slow. Can any genius out there give me a hand? The script will potentially be called quite a lot, I don't want it chewing up all my servers CPU.

while ($html =~ /\b$WHATWANT{'term'}'?s?\b/gis) { $termstotal++; } ## 
+End while

while ($html =~ /<body>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/body>/gis) { 
+$termsbody++; } ## End while

while ($html =~ /<title>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/title>/gis) 
+{ $termstitle++; } ## End while

while ($html =~ /<h1.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h1>/gis) { $
+termshead1++; } ## End while
while ($html =~ /<h2.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h2>/gis) { $
+termshead2++; } ## End while
while ($html =~ /<h3.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h3>/gis) { $
+termshead3++; } ## End while
while ($html =~ /<h4.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h4>/gis) { $
+termshead4++; } ## End while
while ($html =~ /<h5.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h5>/gis) { $
+termshead5++; } ## End while
while ($html =~ /<h6.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h6>/gis) { $
+termshead6++; } ## End while

while ($html =~ /alt=\".*?\b$WHATWANT{'term'}'?s?\b.*?\"/gis) { $terms
+alt++; } ## End while
while ($html =~ /alt=\'.*?\b$WHATWANT{'term'}'?s?\b.*?\'/gis) { $terms
+alt++; } ## End while

while ($html =~ /<a .*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/a>/gis) { $t
+ermsa++; } ## End while

while ($html =~ /<\!--.*?\b$WHATWANT{'term'}'?s?\b.*?-->/gis) { $terms
+comment++; } ## End while

while ($html =~ /<li.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/li>/gis) { $
+termsbullet++; } ## End while

while ($html =~ /href=\".*?$WHATWANT{'term'}'?s?.*?\"/gis) { $termshre
+f++; } ## End while
while ($html =~ /href=\'.*?$WHATWANT{'term'}'?s?.*?\'/gis) { $termshre
+f++; } ## End while

while ($html =~ /<p.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/p>/gis) { $te
+rmsp++; } ## End while
[download]

Thanks

Comment on Slow regexp Download Code

Replies are listed 'Best First'.
Re: Slow regexp by blokhead (Monsignor) on Feb 02, 2005 at 03:09 UTC
This is pretty bad. You run a regex on the entire file SEVENTEEN times! Not to mention the fact that the regexes are pretty basic and it won't work on nontrivial HTML. And all 17 lines you have are pretty much the same. Try to factor out the common parts. Instead of thinking about the problem in terms of regexes, a better way might be to parse the file. You might say "But blokhead, doesn't parsing take longer than just running a regex? And isn't parsing hard?" Well, parsing will be more robust, more extensible, and you will only take one pass through the file, not as many passes as there are tags you care about. I guess you will have to run some benchmarks to be sure. Anyway, sometimes speed should lose to extensibility. And hard? Not really with HTML::Parser. Here's code that parses the file, searching for the keyword. It keeps track of the last tag it has seen, and when it finds the keyword, it adds to the appropriate slot of the %seen. It searches both in text and inside the tag attributes (alt, href) that you specify. use HTML::Parser; use Data::Dumper; my $search_term = qr/\b something here \b/ix; my @tags_to_search = qw[ title h1 h2 h3 h4 h5 h6 a li p pre img +]; my @attributes_to_search = qw[ alt href ]; my %seen; my $last_seen_tag; sub start { my ($tagname, $attr, $text) = @_; $last_seen_tag = $tagname; for (@attributes_to_search) { $seen{$_} += $attr->{$_} =~ m/$search_term/g if $attr->{$_}; } } sub text { my $text = shift; $seen{$last_seen_tag} += $text =~ m/$search_term/g; } my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], text_h => [\&text, "text"], unbroken_text => 1, report_tags => \@tags_to_search ); $p->parse_file("foo.html"); print Dumper \%seen; [download] Update: this script would report finding something "within" an img tag if an img was the last tag it saw when the regex mathced. I only have the parser report on img tags so you can peek at their alt attributes. I leave it as an exercise to the reader not to put such non-enclosing tags (like img, br) into $last_seen_tag. blokhead	[reply] [d/l]
Re: Slow regexp by Roy Johnson (Monsignor) on Feb 02, 2005 at 03:21 UTC
You might look at HTML::Parser instead of rolling your own. Failing that, you might try combining many of these into one search, so you're not looping through the data so much. Maybe something like: `while ($html =~ /<(\w+).?>\b$WHATWANT{'term'}?s?\b.?</\1>/gis) { my $tag = $1; $tag =~ tr/A-Z/a-z/; if ($tag eq 'title') { } elsif ($tag eq 'h1') { #.... }` [download] Your two patterns for single quote vs double quote could be reduced to one, like `while ($html =~ /href=(['"]).?$WHATWANT{'term'}'?s?.?\1/gis) { $term +shref++; } ## End while` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re: Slow regexp by holli (Abbot) on Feb 02, 2005 at 03:32 UTC
As the others already pointed out, HTML::Parser or one of its subclasses is the way to go. Here i use HTML::Toke parser, saves you some work. The code below is not equiv. to yours but something to get you started. use strict; use HTML::TokeParser; use Data::Dumper; my %tokens; my %tokencount; my $p = HTML::TokeParser->new("test.html") or die "Can't open: $!"; while (my $token = $p->get_token) { if ( $token->[0] eq "S" ) { $tokens{$token->[1]}++ unless $token->[1] =~ /meta/i; } elsif ( $token->[0] eq "E" ) { $tokens{$token->[1]}--; } elsif ( $token->[0] eq "T" ) { my @words = ( $token->[1] =~ /\b(\w+)/g ); for ( keys %tokens ) { $tokencount{$_} += @words if $tokens{$_} > 0; } } } print Dumper (\%tokencount); [download] When test.html looks like `<html lang='en-US'> <head> <title>Stuff</title> <meta name='author' content='Jojo' /> </head> <body> <h2>I like potatoes!</h2> <h1>Me not!</h1> </body> </html>` [download] it will print `$VAR1 = { 'h1' => 2, 'body' => 5, 'head' => 1, 'html' => 6, 'title' => 1, 'h2' => 3 };` [download] holli, regexed monk	[reply] [d/l] [select]
Re: Slow regexp by manigandans (Initiate) on Feb 02, 2005 at 14:36 UTC
Using symbolic references, the code you've written can be reduced as follows and this will also work faster. `@tags = qw[body title h1 h2 h3 h4 h5 h6 li p a]; @attributes = qw[alt href]; foreach $tag (@tags) { $$tag++ while ($html =~ /<$tag.?>.?\b$WHATWANT{'term'}'?s?\b.?< +\/$tag>/gis); } foreach $attribute (@attributes) { $$attribute++ while ($html =~ /(alt\|href)=["'].?\b$WHATWANT{'term +'}'?s?\b.*?["']/gis); }` [download] Mani.	[reply] [d/l]