richill has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've got a query quite similar to the "HTTP::Request pipe through regex" I'm actually trying to search through a lot of html files and perform a search and replace. The strings to be replaced and their replacements are contained in an xml file.

I've managed to get this working, but this is no guarantee its any good. Could you please have a look at it and tell me if there's a better way of doing this.

There are some improvements I'd like to do soon.

The data orginally comes from and excel spreadsheet so converting it to xml using perl would be nice.

Im outputing to stout at the moment but I'll add another file handle later.

For now Im just interested in coding in a better style. I'm worried that creating a new XML::Simple()xml object for every line is a bit wasteful and that I can edit none html files. Is there anything else? Thanks

use Cwd; use XML::Simple; use Data::Dumper; my $dir = "files/html"; my $xmlfile = ""; # need a full filepath because a relative filepath +affected by the File::Find changing the current directory my $file; my $t =0; #### search all files in the directory and perform a search and replac +e using an xml file #### as the sourse of the matched string and the replacement string. use File::Find; find(\&searchlines, $dir); ################### functions ####################### #### searchlines uses a filehandle to open up all the files and searc +h them line by line sub searchlines { #print "$File::Find::name \n"; return unless -f $_; open(FILE,$_) || die "Cannot open $_ \n"; my @data = <FILE>; foreach my $line (@data){ ## pass the curent line of text to the next function ## xmlfeed($line); } close(FILE); } #### xmlfeed uses the xml file to get a list of the strings to repla +ce sub xmlfeed { my $linedata = shift @_; #print $linedata; # create object $xml = new XML::Simple(); # read XML file $data = $xml->XMLin($xmlfile); # processes the xml file one sheet at a time # <Sheet1> # <OriginPage></OriginPage> # <LinkToPage></LinkToPage> # <LinkToPageStatus></LinkToPageStatus> # <LinkToPageTitle></LinkToPageTitle> # <New_location></New_location> # </Sheet1> foreach $e (@{$data->{Sheet1}}) { #print $e->{Sheet1}, "\n"; # print "LinkToPage: ", $e->{LinkToPage}, " \nNew page +", $e->{New_location}, "\n"; &match($linedata , $e->{LinkToPage}, $e->{New_location +} ) #print "\n"; } } #### matches the current line against the string in <LinkToPage></Lin +kToPage> # performs a substitution on the stout. sub match { my $in = shift @_; my $originalURL = shift @_; my $newURL = shift @_; #$in =~ s/$originalURL/$newURL/; #$in =~ s/$originalURL/$newURL/ #print "$originalURL, $newURL\n" ; if($in =~ m/$originalURL/) { $t++; print "Matched x " . "$t \n" ; print "$File::Find::name \n"; } }

Replies are listed 'Best First'.
Re: using xml and perl to perform a search and replace on html files
by wfsp (Abbot) on Mar 10, 2007 at 10:38 UTC
    I've managed to get this working...
    That, richill, I'd consider a result! Make a backup and lock it in the safe. :-)
    I'm worried that creating a new XML::Simple()xml object for every line is a bit wasteful...
    Perhaps consider parsing the XML once and storing the data in a hash?
    $xml_hash($LinkToPage} = ($New_location);
    Then look at every HTML file checking if any links are in your lookup table and make the change if necessary.

    For changing the HTML I would consider a parser. There are many and the one I frequently use is HTML::TokeParser::Simple. Have a look and get back to us if you need a hand.

    update: added example of using a parser.

    #!/usr/local/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my %xml_hash = ( 'link1.html' => 'linka.html', 'link2.html' => 'linkb.html', ); my $html_file = 'links.html'; my $p = HTML::TokeParser::Simple->new($html_file) or die "couldn't parse $html_file"; my $new_html; while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ my $href = $t->get_attr('href'); if (exists $xml_hash{$href}){ $t->set_attr('href', $xml_hash{$href}); } } $new_html .= $t->as_is; } print "$new_html\n";
    input html:
    <html> <head> <title>links</title> </head> <body> <p>links</p> <a href="link1.html">link1</a> <a href="link2.html">link2</a> </body> </html>
    output:
    <html> <head> <title>links</title> </head> <body> <p>links</p> <a href="linka.html">link1</a> <a href="linkb.html">link2</a> </body> </html>
      I should have locked it up,

      Outputing the results to stout was a breeze compared with making the changes on the text files.

      For some reason the first url replacement listed on the xml file 'xmlfile/urlchange.xml' below is applied to all html files, all other url replacements are ignored.

      Its very frustrating beacause the output to stout looks ok. Afer making the changes why dont they save to file. I cant even begin debug this thing.

      If you have the time to look

      The xml file is placed at 'xmlfile/urlchange.xml' relative to the script.

      the values of the <OriginPage> elements refer to html files and the paths are relative to the perl script when the value od $dir is prefixed onto the string.

      my $dir + <OriginPage>/meeting/index.asp</OriginPage>

      <?xml version='1.0'?> <urls> <Sheet1> <OriginPage>/meeting/index.asp</OriginPage> <LinkToPage>http://quicklinkurl1/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>xxxxxxx</New_location> </Sheet1> <Sheet1> <OriginPage>/meeting/series/index.asp</OriginPage> <LinkToPage>http://quicklinkurl1/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>new url</New_location> </Sheet1> <Sheet1> <OriginPage>/meeting/lunchtime-meeting/index.asp</OriginPa +ge> <LinkToPage>http://quicklinkurl2/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>changed url</New_location> </Sheet1> <Sheet1> <OriginPage>/meeting/lunchtime-meeting/index.asp</OriginPa +ge> <LinkToPage>http://quicklinkurl2/index.asp</LinkToPage> <LinkToPageStatus>na</LinkToPageStatus> <LinkToPageTitle>na</LinkToPageTitle> <New_location>another changed url</New_location> </Sheet1> </urls>

      My perl code is here

        Hi richill!

        This is what I had in mind. Parse the xml file once building a hash keyed on file name with sub keys giving the before and after links (see the Dump of the hash below).

        The File::Find callback sub parses each html file that 'exists' in the lookup and changes the link as appropriate.

        The html is written to STDOUT. The file still needs to be written to its final destination.

        Hope that helps and good luck :-)

        #!/usr/local/bin/perl use strict; use warnings; use XML::Simple; use File::Find; use HTML::TokeParser::Simple; use Data::Dumper; my $dir = "C:/perm/monks/files/html"; my $xmlfile = "C:/perm/monks/xmlfile/urlchange.xml"; my %lookup = get_lookup($xmlfile); find(\&wanted, $dir); sub wanted { my $file = $_; return unless -f $file; my $rel = $File::Find::name; $rel =~ s/$dir//; return unless exists $lookup{$rel}; my $p = HTML::TokeParser::Simple->new($file) or die "couldn't parse $file"; my $new_html; while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ my $href = $t->get_attr('href'); if ($lookup{$rel}{from} = $href){ $t->set_attr('href', $lookup{$rel}{to}); } } $new_html .= $t->as_is; } if (1){ print "$File::Find::name\n"; print "$new_html\n"; print '-' x 20, "\n"; } # todo # write new_html } sub get_lookup{ my ($xmlfile) = @_; my $xml = new XML::Simple(); my $data = $xml->XMLin($xmlfile); my %lookup; for my $sheet (keys %{$data}){ my @records = @{$data->{$sheet}}; for my $record (@records){ $lookup{$record->{OriginPage}} = { from => $record->{LinkToPage}, to => $record->{New_location}, } } } return %lookup; }
        Dump of %lookup:
        Sheet1 $VAR1 = { '/meeting/series/index.asp' => { 'to' => 'new url', 'from' => 'http://quicklinkurl1/index.asp' }, '/meeting/lunchtime-meeting/index.asp' => { 'to' => 'another changed url', 'from' => 'http://quicklinkurl2/index.asp' }, '/meeting/index.asp' => { 'to' => 'xxxxxxx', 'from' => 'http://quicklinkurl1/index.asp' } };
        I created some dummy html files (using file names and links from the xml file) e.g.:
        <html> <head> <title>index.asp</title> </head> <body> <a href="http://quicklinkurl1/index.asp">link</a> </body> </html>
        output showing that the links have been changed:
        C:/perm/monks/files/html/meeting/index.asp <html> <head> <title>index.asp</title> </head> <body> <a href="xxxxxxx">link</a> </body> </html> -------------------- C:/perm/monks/files/html/meeting/lunchtime-meeting/index.asp <html> <head> <title>index.asp</title> </head> <body> <a href="another changed url">link</a> </body> </html> -------------------- C:/perm/monks/files/html/meeting/series/index.asp <html> <head> <title>index.asp</title> </head> <body> <a href="new url">link</a> </body> </html> --------------------
        update:
        fixed some badly formed html in the dummy files.
        update 2:
        Note that the last two records in you xml refer to the same page and link but with a different 'new location'. The third record was therefore ignored.