in reply to Re: using xml and perl to perform a search and replace on html files
in thread using xml and perl to perform a search and replace on html files

I should have locked it up,

Outputing the results to stout was a breeze compared with making the changes on the text files.

For some reason the first url replacement listed on the xml file 'xmlfile/urlchange.xml' below is applied to all html files, all other url replacements are ignored.

Its very frustrating beacause the output to stout looks ok. Afer making the changes why dont they save to file. I cant even begin debug this thing.

If you have the time to look

The xml file is placed at 'xmlfile/urlchange.xml' relative to the script.

the values of the <OriginPage> elements refer to html files and the paths are relative to the perl script when the value od $dir is prefixed onto the string.

my $dir + <OriginPage>/meeting/index.asp</OriginPage>

<?xml version='1.0'?> <urls> <Sheet1> <OriginPage>/meeting/index.asp</OriginPage> <LinkToPage>http://quicklinkurl1/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>xxxxxxx</New_location> </Sheet1> <Sheet1> <OriginPage>/meeting/series/index.asp</OriginPage> <LinkToPage>http://quicklinkurl1/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>new url</New_location> </Sheet1> <Sheet1> <OriginPage>/meeting/lunchtime-meeting/index.asp</OriginPa +ge> <LinkToPage>http://quicklinkurl2/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>changed url</New_location> </Sheet1> <Sheet1> <OriginPage>/meeting/lunchtime-meeting/index.asp</OriginPa +ge> <LinkToPage>http://quicklinkurl2/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>another changed url</New_location> </Sheet1> </urls>

My perl code is here

use Cwd; use XML::Simple; use Data::Dumper; use File::Find; use File::NCopy; my $dir = "htmlsource/"; my $newdir = "xx"; my $xmlfile ="xmlfile/urlchange.xml"; my $file; my $t = 0; my $xmldata; my $fileout; #### a program to search all files in the directory and performs a sea +rch and replace using an xml file #### as the sourse of the matched string and the replacement string. $| = 1; # force auto flush of output buffer &doCopy( $dir, $newdir ); &createXMLlook(); &find( { wanted => \&wanted, no_chdir => 1 }, $newdir ); ################### functions ####################### #### wanted uses a filehandle to open up all the files and search the +m line by line sub wanted { #print "$File::Find::name \n"; return unless -f $_; open( FILE, $_ ) || die "Cannot open $_ \n"; my @data = <FILE>; foreach my $line (@data) { ## pass the curent line of text to the next function ## xmlfeed($line); } close(FILE); } ### create lookup table containing XML values sub createXMLlook { # create object $xml = new XML::Simple(); # read XML file $xmldata = $xml->XMLin($xmlfile); #print Dumper($xmldata); } #### xmlfeed uses the xml file to get a list of the strings to repla +ce sub xmlfeed { my $linedata = shift @_; #print $linedata; # processes the xml file one sheet at a time # <Sheet1> # <OriginPage></OriginPage> # <LinkToPage></LinkToPage> # <LinkToPageStatus></LinkToPageStatus> # <LinkToPageTitle></LinkToPageTitle> # <New_location></New_location> # </Sheet1> foreach $e ( @{ $xmldata->{Sheet1} } ) { &checkpage( $e, $linedata ); } } ### checks that the current page should have the regex applied sub checkpage { my $value; my $e = shift @_; my $linedata = shift @_; my $originpage = $e->{OriginPage}; my $currentfile = "$File::Find::name"; $currentfile =~ /(\/)([.a-zA-Z0-9_\-]+)(\/)([.a-zA-Z0-9_\/\-]+)/; $currentfile = $1 . $2 . $3 . $4; if ( $currentfile eq $originpage ) { #print "$originpage\n$currentfile \n\n"; #perform replacement &replace( $linedata, $e->{LinkToPage}, $e->{New_location} ); } else { return; } } #### performs a rex ex substitution on the stout. sub replace { my $in = shift @_; my $originalURL = shift @_; my $newURL = shift @_; my $fileout = "$File::Find::name"; #print "FILEOUT $fileout\n "; if ( $in =~ m/$originalURL/ ) { $t++; print "Matched x " . "$t\nEdited file $File::Find::name\nS +tring in $in Original value $originalURL\nNew value $newUR +L\n"; $in =~ m/(.*)($originalURL)(.*)/; #print "$1 \n $2 \n $3\n\n"; #print "$1 \n $newURL\n $3\n"; $in =~ s/(.*)($originalURL)(.*)/$1$newURL$3/; print "Edited line: $in\n\n"; } $editedcontent .= $in; chmod( 0777, "$fileout" ) || print $!; open FILEOUT, ">$fileout" or die "can't open $fileout for writing +: $!"; print FILEOUT $editedcontent; } #### recursivley copies existing folder into new location sub doCopy { my $path = shift @_; my $newpath = shift @_; mkdir($newpath) or die "Could not mkdir $newpath: $!"; my $cp = File::NCopy->new( recursive => 1 ); $cp->copy( "$path/*", $newpath ) or die "Could not perform rcopy of $path to $newpath: $!"; }

Replies are listed 'Best First'.
Re^3: using xml and perl to perform a search and replace on html files
by wfsp (Abbot) on Mar 11, 2007 at 09:52 UTC
    Hi richill!

    This is what I had in mind. Parse the xml file once building a hash keyed on file name with sub keys giving the before and after links (see the Dump of the hash below).

    The File::Find callback sub parses each html file that 'exists' in the lookup and changes the link as appropriate.

    The html is written to STDOUT. The file still needs to be written to its final destination.

    Hope that helps and good luck :-)

    #!/usr/local/bin/perl use strict; use warnings; use XML::Simple; use File::Find; use HTML::TokeParser::Simple; use Data::Dumper; my $dir = "C:/perm/monks/files/html"; my $xmlfile = "C:/perm/monks/xmlfile/urlchange.xml"; my %lookup = get_lookup($xmlfile); find(\&wanted, $dir); sub wanted { my $file = $_; return unless -f $file; my $rel = $File::Find::name; $rel =~ s/$dir//; return unless exists $lookup{$rel}; my $p = HTML::TokeParser::Simple->new($file) or die "couldn't parse $file"; my $new_html; while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ my $href = $t->get_attr('href'); if ($lookup{$rel}{from} = $href){ $t->set_attr('href', $lookup{$rel}{to}); } } $new_html .= $t->as_is; } if (1){ print "$File::Find::name\n"; print "$new_html\n"; print '-' x 20, "\n"; } # todo # write new_html } sub get_lookup{ my ($xmlfile) = @_; my $xml = new XML::Simple(); my $data = $xml->XMLin($xmlfile); my %lookup; for my $sheet (keys %{$data}){ my @records = @{$data->{$sheet}}; for my $record (@records){ $lookup{$record->{OriginPage}} = { from => $record->{LinkToPage}, to => $record->{New_location}, } } } return %lookup; }
    Dump of %lookup:
    Sheet1 $VAR1 = { '/meeting/series/index.asp' => { 'to' => 'new url', 'from' => 'http://quicklinkurl1/index.asp' }, '/meeting/lunchtime-meeting/index.asp' => { 'to' => 'another changed url', 'from' => 'http://quicklinkurl2/index.asp' }, '/meeting/index.asp' => { 'to' => 'xxxxxxx', 'from' => 'http://quicklinkurl1/index.asp' } };
    I created some dummy html files (using file names and links from the xml file) e.g.:
    <html> <head> <title>index.asp</title> </head> <body> <a href="http://quicklinkurl1/index.asp">link</a> </body> </html>
    output showing that the links have been changed:
    C:/perm/monks/files/html/meeting/index.asp <html> <head> <title>index.asp</title> </head> <body> <a href="xxxxxxx">link</a> </body> </html> -------------------- C:/perm/monks/files/html/meeting/lunchtime-meeting/index.asp <html> <head> <title>index.asp</title> </head> <body> <a href="another changed url">link</a> </body> </html> -------------------- C:/perm/monks/files/html/meeting/series/index.asp <html> <head> <title>index.asp</title> </head> <body> <a href="new url">link</a> </body> </html> --------------------
    update:
    fixed some badly formed html in the dummy files.
    update 2:
    Note that the last two records in you xml refer to the same page and link but with a different 'new location'. The third record was therefore ignored.

      Thanks very much WFSP. It looks more elegant than my approach, its easier to read. Are there other benefits. At the time of writing I believed that:

      Toke parser is more useful for parsing html becasue its more robust then a simple regex. However there is a trade-off in speed.

      I was not sure if %lookup would be needed because xml simple sticks everything in an anon. hash anyway. I can use that as the look-up.

      But your approach was different and worked, so what was i wrong about.

      Thanks again for your help.