changing data

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
•Re: changing data by merlyn (Sage) on Feb 25, 2002 at 17:41 UTC
HTML::HeadParser picks up meta data extremely efficiently, stopping the parse when the head is complete. I also have a column on walking a web tree automatically picking out the meta keywords for an index. -- Randal L. Schwartz, Perl hacker	[reply]
(Ovid -- don't use a regex) Re: changing data by Ovid (Cardinal) on Feb 25, 2002 at 18:43 UTC
Using a regex to parse HTML is typically a bad idea. It's very, very easy to get wrong. The more HTML you have, the more likely the regex will have problems. Using a proper HTML parser will avoid these issues. What you're looking for is fairly straightforward. Here's a basic shell of what you need (actually, it might be all you need). It uses File::Find to get the docs and HTML::TokeParser to find the meta tags. Original docs will be backed up with a .bak extension. I will confess that I haven't done a huge amount of testing of this solution, so be careful! use strict; use File::Find; use HTML::TokeParser; my $bak_ext = '.bak'; my $root_dir = '/temp'; find(\&wanted, $root_dir); sub wanted { # if the extension fits... if ( /\.html?/i ) { print "Processing $_\n"; my $new = $_; my $bak = $_ . $bak_ext; rename $_, $bak or die "Cannot rename $_ to $bak: $!"; open NEW, "> $new" or die "Cannot open $new for writing: $!"; my $p = HTML::TokeParser->new( $bak ); while ( my $token = $p->get_token ) { # this index is the 'raw text' of the token my $text_index = $token->[0] eq 'T' ? 1 : -1; # it's both a start tag and a meta tag if ( $token->[0] eq 'S' and $token->[1] eq 'meta' ) { $token->[ $text_index ] =~ s/AA\.//g; } print NEW $token->[ $text_index ]; } close NEW; } else { print "Skipping $_\n"; } } [download] Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: (Ovid -- don't use a regex) Re: changing data by Anonymous Monk on Feb 25, 2002 at 19:49 UTC
Thanks Ovid, I tried your script and it still didnt change any of my data as needed. Is there something I am doing wrong?	[reply]
(Ovid) Re(3): changing data by Ovid (Cardinal) on Feb 25, 2002 at 20:18 UTC
My understanding was that you needed to recursively search through all HTML documents and munge the meta tags. That was a guess because I didn't really know if you wanted HTML docs or not. Here's a list of things to consider: Are there any error messages generated? Did you change the `$root_dir` variable to point to the root directory of the documents that you wanted to change? To determine if we have a correct document type, I use the following regex to check the extension: `/\.html?/i`. Is that correct? If not, update the regex. Also, that regex has a bug. It should be `/\.html?$/i`. Sorry 'bout that. (this bug merely creates extra `.bak` files. It's recoverable) This program lists the files that it is processing and the files that it is skipping. Does that list match your expectations? Regarding the last bullet point: in the `&wanted` subroutine, `$_` is the current filename you are processing and that is what is getting printed. If you need to tweak which files get processed, this is the variable to take a look at. Read the docs for the File::Find module for more information. Check those things and you'll have a good idea of how to proceed. I just tossed out the shell of what you were looking for. You'll need to adjust it to suit your personal needs. Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply]
Re: (Ovid) Re(3): changing data by Anonymous Monk on Feb 26, 2002 at 14:13 UTC
Re: changing data by AidanLee (Chaplain) on Feb 25, 2002 at 16:23 UTC
well, I can lend a hand on your regex: `foreach my $line ( grep { $_ =~ /<META NAME="AA\./ } <FILEHANDLE> ) { $line =~ s/"AA\.(.+?)"/"$1"/; }` [download] the grep filters the input from the file so you only have to deal with the lines that have the meta tags in question. Then you don't need to escape quotes and equal in the regex, rendering your regex much more legible. Reading in directly from the filehandle will also use up less memory at any given point since you don't have to read in the whole file. note that this general technique of going line by line will only work for you if the meta tag isn't split up across multiple lines	[reply] [d/l]
Re: Re: changing data by Anonymous Monk on Feb 25, 2002 at 18:38 UTC
I tried it with this but the regular expression part isnt working and it seems to only cover one directory. `@files = glob('c:\perl\bin\newer\*'); foreach $db (@files) { open(DATA, "$db") or die "File does not open: $!"; @data = (<DATA>); close (DATA); open(DATA, ">$db") or die "File not open: $!"; foreach $line (@data) { $line =~ s/<meta name\=\AA\.\"/<meta name\=\\./gi; print DATA $line; } close(DATA); }` [download]	[reply] [d/l]
Re: Re: Re: changing data by AidanLee (Chaplain) on Feb 25, 2002 at 21:28 UTC
Well, my post did not address traversing directories. I'd encourage you to try substituting everything but the first line of your outer foreach loop with what I suggested. If you would like further explanation of what I've done, I'll gladly provide it.	[reply]