Using a regex to parse HTML is typically a bad idea. It's very, very easy to get wrong. The more HTML you have, the more likely the regex will have problems. Using a proper HTML parser will avoid these issues.
What you're looking for is fairly straightforward. Here's a basic shell of what you need (actually, it might be all you need). It uses File::Find to get the docs and HTML::TokeParser to find the meta tags. Original docs will be backed up with a .bak extension.
I will confess that I haven't done a huge amount of testing of this solution, so be careful!
use strict; use File::Find; use HTML::TokeParser; my $bak_ext = '.bak'; my $root_dir = '/temp'; find(\&wanted, $root_dir); sub wanted { # if the extension fits... if ( /\.html?/i ) { print "Processing $_\n"; my $new = $_; my $bak = $_ . $bak_ext; rename $_, $bak or die "Cannot rename $_ to $bak: $!"; open NEW, "> $new" or die "Cannot open $new for writing: $!"; my $p = HTML::TokeParser->new( $bak ); while ( my $token = $p->get_token ) { # this index is the 'raw text' of the token my $text_index = $token->[0] eq 'T' ? 1 : -1; # it's both a start tag and a meta tag if ( $token->[0] eq 'S' and $token->[1] eq 'meta' ) { $token->[ $text_index ] =~ s/AA\.//g; } print NEW $token->[ $text_index ]; } close NEW; } else { print "Skipping $_\n"; } }
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
In reply to (Ovid -- don't use a regex) Re: changing data
by Ovid
in thread changing data
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |