Instead of thinking about the problem in terms of regexes, a better way might be to parse the file. You might say "But blokhead, doesn't parsing take longer than just running a regex? And isn't parsing hard?" Well, parsing will be more robust, more extensible, and you will only take one pass through the file, not as many passes as there are tags you care about. I guess you will have to run some benchmarks to be sure. Anyway, sometimes speed should lose to extensibility. And hard? Not really with HTML::Parser.
Here's code that parses the file, searching for the keyword. It keeps track of the last tag it has seen, and when it finds the keyword, it adds to the appropriate slot of the %seen. It searches both in text and inside the tag attributes (alt, href) that you specify.
Update: this script would report finding something "within" an img tag if an img was the last tag it saw when the regex mathced. I only have the parser report on img tags so you can peek at their alt attributes. I leave it as an exercise to the reader not to put such non-enclosing tags (like img, br) into $last_seen_tag.use HTML::Parser; use Data::Dumper; my $search_term = qr/\b something here \b/ix; my @tags_to_search = qw[ title h1 h2 h3 h4 h5 h6 a li p pre img +]; my @attributes_to_search = qw[ alt href ]; my %seen; my $last_seen_tag; sub start { my ($tagname, $attr, $text) = @_; $last_seen_tag = $tagname; for (@attributes_to_search) { $seen{$_} += $attr->{$_} =~ m/$search_term/g if $attr->{$_}; } } sub text { my $text = shift; $seen{$last_seen_tag} += $text =~ m/$search_term/g; } my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], text_h => [\&text, "text"], unbroken_text => 1, report_tags => \@tags_to_search ); $p->parse_file("foo.html"); print Dumper \%seen;
blokhead
In reply to Re: Slow regexp
by blokhead
in thread Slow regexp
by cosmicperl
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |