Re: Slow regexp

As the others already pointed out, HTML::Parser or one of its subclasses is the way to go. Here i use HTML::Toke parser, saves you some work.
The code below is not equiv. to yours but something to get you started.

use strict;
use HTML::TokeParser;
use Data::Dumper;

my %tokens;
my %tokencount;

my $p = HTML::TokeParser->new("test.html") or die "Can't open: $!";

while (my $token = $p->get_token)
{
    if ( $token->[0] eq "S" )
    {
        $tokens{$token->[1]}++
            unless $token->[1] =~ /meta/i;
    }
    elsif ( $token->[0] eq "E" )
    {
        $tokens{$token->[1]}--;
    }
    elsif ( $token->[0] eq "T" )
    {
        my @words = ( $token->[1] =~ /\b(\w+)/g );

        for ( keys %tokens )
        {
            $tokencount{$_} += @words
                if $tokens{$_} > 0;
        }
    }
}

print Dumper (\%tokencount);
[download]

When test.html looks like

<html lang='en-US'>
<head>
<title>Stuff</title>
<meta name='author' content='Jojo' />
</head>
<body>
<h2>I like potatoes!</h2>
<h1>Me not!</h1>
</body>
</html>
[download]

it will print

$VAR1 = {
          'h1' => 2,
          'body' => 5,
          'head' => 1,
          'html' => 6,
          'title' => 1,
          'h2' => 3
        };
[download]

holli, regexed monk

Comment on Re: Slow regexp Select or Download Code