cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
   I'm no expert at regexps. I've done the following code to get some SEO details from the page of HTML. The only problem is it isn't fast. In fact it's plain slow. Can any genius out there give me a hand? The script will potentially be called quite a lot, I don't want it chewing up all my servers CPU.
while ($html =~ /\b$WHATWANT{'term'}'?s?\b/gis) { $termstotal++; } ## +End while while ($html =~ /<body>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/body>/gis) { +$termsbody++; } ## End while while ($html =~ /<title>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/title>/gis) +{ $termstitle++; } ## End while while ($html =~ /<h1.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h1>/gis) { $ +termshead1++; } ## End while while ($html =~ /<h2.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h2>/gis) { $ +termshead2++; } ## End while while ($html =~ /<h3.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h3>/gis) { $ +termshead3++; } ## End while while ($html =~ /<h4.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h4>/gis) { $ +termshead4++; } ## End while while ($html =~ /<h5.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h5>/gis) { $ +termshead5++; } ## End while while ($html =~ /<h6.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/h6>/gis) { $ +termshead6++; } ## End while while ($html =~ /alt=\".*?\b$WHATWANT{'term'}'?s?\b.*?\"/gis) { $terms +alt++; } ## End while while ($html =~ /alt=\'.*?\b$WHATWANT{'term'}'?s?\b.*?\'/gis) { $terms +alt++; } ## End while while ($html =~ /<a .*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/a>/gis) { $t +ermsa++; } ## End while while ($html =~ /<\!--.*?\b$WHATWANT{'term'}'?s?\b.*?-->/gis) { $terms +comment++; } ## End while while ($html =~ /<li.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/li>/gis) { $ +termsbullet++; } ## End while while ($html =~ /href=\".*?$WHATWANT{'term'}'?s?.*?\"/gis) { $termshre +f++; } ## End while while ($html =~ /href=\'.*?$WHATWANT{'term'}'?s?.*?\'/gis) { $termshre +f++; } ## End while while ($html =~ /<p.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?<\/p>/gis) { $te +rmsp++; } ## End while


Thanks

Replies are listed 'Best First'.
Re: Slow regexp
by blokhead (Monsignor) on Feb 02, 2005 at 03:09 UTC
    This is pretty bad. You run a regex on the entire file SEVENTEEN times! Not to mention the fact that the regexes are pretty basic and it won't work on nontrivial HTML. And all 17 lines you have are pretty much the same. Try to factor out the common parts.

    Instead of thinking about the problem in terms of regexes, a better way might be to parse the file. You might say "But blokhead, doesn't parsing take longer than just running a regex? And isn't parsing hard?" Well, parsing will be more robust, more extensible, and you will only take one pass through the file, not as many passes as there are tags you care about. I guess you will have to run some benchmarks to be sure. Anyway, sometimes speed should lose to extensibility. And hard? Not really with HTML::Parser.

    Here's code that parses the file, searching for the keyword. It keeps track of the last tag it has seen, and when it finds the keyword, it adds to the appropriate slot of the %seen. It searches both in text and inside the tag attributes (alt, href) that you specify.

    use HTML::Parser; use Data::Dumper; my $search_term = qr/\b something here \b/ix; my @tags_to_search = qw[ title h1 h2 h3 h4 h5 h6 a li p pre img +]; my @attributes_to_search = qw[ alt href ]; my %seen; my $last_seen_tag; sub start { my ($tagname, $attr, $text) = @_; $last_seen_tag = $tagname; for (@attributes_to_search) { $seen{$_} += $attr->{$_} =~ m/$search_term/g if $attr->{$_}; } } sub text { my $text = shift; $seen{$last_seen_tag} += $text =~ m/$search_term/g; } my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], text_h => [\&text, "text"], unbroken_text => 1, report_tags => \@tags_to_search ); $p->parse_file("foo.html"); print Dumper \%seen;
    Update: this script would report finding something "within" an img tag if an img was the last tag it saw when the regex mathced. I only have the parser report on img tags so you can peek at their alt attributes. I leave it as an exercise to the reader not to put such non-enclosing tags (like img, br) into $last_seen_tag.

    blokhead

Re: Slow regexp
by Roy Johnson (Monsignor) on Feb 02, 2005 at 03:21 UTC
    You might look at HTML::Parser instead of rolling your own.

    Failing that, you might try combining many of these into one search, so you're not looping through the data so much. Maybe something like:

    while ($html =~ /<(\w+).*?>\b$WHATWANT{'term'}?s?\b.*?</\1>/gis) { my $tag = $1; $tag =~ tr/A-Z/a-z/; if ($tag eq 'title') { } elsif ($tag eq 'h1') { #.... }
    Your two patterns for single quote vs double quote could be reduced to one, like
    while ($html =~ /href=(['"]).*?$WHATWANT{'term'}'?s?.*?\1/gis) { $term +shref++; } ## End while

    Caution: Contents may have been coded under pressure.
Re: Slow regexp
by holli (Abbot) on Feb 02, 2005 at 03:32 UTC
    As the others already pointed out, HTML::Parser or one of its subclasses is the way to go. Here i use HTML::Toke parser, saves you some work.
    The code below is not equiv. to yours but something to get you started.
    use strict; use HTML::TokeParser; use Data::Dumper; my %tokens; my %tokencount; my $p = HTML::TokeParser->new("test.html") or die "Can't open: $!"; while (my $token = $p->get_token) { if ( $token->[0] eq "S" ) { $tokens{$token->[1]}++ unless $token->[1] =~ /meta/i; } elsif ( $token->[0] eq "E" ) { $tokens{$token->[1]}--; } elsif ( $token->[0] eq "T" ) { my @words = ( $token->[1] =~ /\b(\w+)/g ); for ( keys %tokens ) { $tokencount{$_} += @words if $tokens{$_} > 0; } } } print Dumper (\%tokencount);
    When test.html looks like
    <html lang='en-US'> <head> <title>Stuff</title> <meta name='author' content='Jojo' /> </head> <body> <h2>I like potatoes!</h2> <h1>Me not!</h1> </body> </html>
    it will print
    $VAR1 = { 'h1' => 2, 'body' => 5, 'head' => 1, 'html' => 6, 'title' => 1, 'h2' => 3 };

    holli, regexed monk
Re: Slow regexp
by manigandans (Initiate) on Feb 02, 2005 at 14:36 UTC
    Using symbolic references, the code you've written can be reduced as follows and this will also work faster.

    @tags = qw[body title h1 h2 h3 h4 h5 h6 li p a]; @attributes = qw[alt href]; foreach $tag (@tags) { $$tag++ while ($html =~ /<$tag.*?>.*?\b$WHATWANT{'term'}'?s?\b.*?< +\/$tag>/gis); } foreach $attribute (@attributes) { $$attribute++ while ($html =~ /(alt|href)=["'].*?\b$WHATWANT{'term +'}'?s?\b.*?["']/gis); }
    Mani.