comment on

This is pretty bad. You run a regex on the entire file SEVENTEEN times! Not to mention the fact that the regexes are pretty basic and it won't work on nontrivial HTML. And all 17 lines you have are pretty much the same. Try to factor out the common parts.

Instead of thinking about the problem in terms of regexes, a better way might be to parse the file. You might say "But blokhead, doesn't parsing take longer than just running a regex? And isn't parsing hard?" Well, parsing will be more robust, more extensible, and you will only take one pass through the file, not as many passes as there are tags you care about. I guess you will have to run some benchmarks to be sure. Anyway, sometimes speed should lose to extensibility. And hard? Not really with HTML::Parser.

Here's code that parses the file, searching for the keyword. It keeps track of the last tag it has seen, and when it finds the keyword, it adds to the appropriate slot of the %seen. It searches both in text and inside the tag attributes (alt, href) that you specify.

use HTML::Parser;
use Data::Dumper;

my $search_term          = qr/\b something here \b/ix;
my @tags_to_search       = qw[ title h1 h2 h3 h4 h5 h6 a li p pre img 
+];
my @attributes_to_search = qw[ alt href ];

my %seen;
my $last_seen_tag;

sub start {
    my ($tagname, $attr, $text) = @_;
    $last_seen_tag = $tagname;

    for (@attributes_to_search) {
        $seen{$_} += $attr->{$_} =~ m/$search_term/g
            if $attr->{$_};
    }
}

sub text {
    my $text = shift;
    $seen{$last_seen_tag} += $text =~ m/$search_term/g;
}

my $p = HTML::Parser->new( api_version => 3,
                           start_h => [\&start, "tagname, attr"],
                           text_h  => [\&text,  "text"],
                           unbroken_text => 1,
                           report_tags => \@tags_to_search
                         );

$p->parse_file("foo.html");
print Dumper \%seen;
[download]

Update: this script would report finding something "within" an img tag if an img was the last tag it saw when the regex mathced. I only have the parser report on img tags so you can peek at their alt attributes. I leave it as an exercise to the reader not to put such non-enclosing tags (like img, br) into $last_seen_tag.

blokhead

In reply to Re: Slow regexp by blokhead
in thread Slow regexp by cosmicperl

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.