Re^3: using xml and perl to perform a search and replace on html files

Hi richill!

This is what I had in mind. Parse the xml file once building a hash keyed on file name with sub keys giving the before and after links (see the Dump of the hash below).

The File::Find callback sub parses each html file that 'exists' in the lookup and changes the link as appropriate.

The html is written to STDOUT. The file still needs to be written to its final destination.

Hope that helps and good luck :-)

#!/usr/local/bin/perl

use strict;
use warnings;

use XML::Simple;
use File::Find;
use HTML::TokeParser::Simple;
use Data::Dumper;

my $dir = "C:/perm/monks/files/html";

my $xmlfile = "C:/perm/monks/xmlfile/urlchange.xml"; 
my %lookup = get_lookup($xmlfile);

find(\&wanted, $dir);

sub wanted {

  my $file = $_;
  return unless -f $file;
    
  my $rel = $File::Find::name;
  $rel =~ s/$dir//;
  return unless exists $lookup{$rel};
  
  my $p = HTML::TokeParser::Simple->new($file)
    or die "couldn't parse $file";
  
  my $new_html;
  while (my $t = $p->get_token){
    if ($t->is_start_tag('a')){
      my $href = $t->get_attr('href');
      if ($lookup{$rel}{from} = $href){
        $t->set_attr('href', $lookup{$rel}{to});
      }
    }
    $new_html .= $t->as_is;
  }
  
  if (1){
    print "$File::Find::name\n";
    print "$new_html\n";
    print '-' x 20, "\n";
  }
  # todo
  # write new_html
  
}    

sub get_lookup{
  
  my ($xmlfile) = @_;
  my $xml = new XML::Simple();
  my $data = $xml->XMLin($xmlfile);
  
  my %lookup;
  for my $sheet (keys %{$data}){
    my @records = @{$data->{$sheet}};
    for my $record (@records){
      $lookup{$record->{OriginPage}} = {
        from => $record->{LinkToPage},
        to   => $record->{New_location},
      }
    }
  }
  
  return %lookup;
}
[download]

Dump of %lookup:

Sheet1
$VAR1 = {
  '/meeting/series/index.asp' => {
   'to' => 'new url',
   'from' => 'http://quicklinkurl1/index.asp'
  },
  '/meeting/lunchtime-meeting/index.asp' => {
    'to' => 'another changed url',
    'from' => 'http://quicklinkurl2/index.asp'
  },
  '/meeting/index.asp' => {
    'to' => 'xxxxxxx',
    'from' => 'http://quicklinkurl1/index.asp'
  }
};
[download]

I created some dummy html files (using file names and links from the xml file) e.g.:

<html>
<head>
<title>index.asp</title>
</head>
<body>
<a href="http://quicklinkurl1/index.asp">link</a>
</body>
</html>
[download]

output showing that the links have been changed:

C:/perm/monks/files/html/meeting/index.asp
<html>
<head>
<title>index.asp</title>
</head>
<body>
<a href="xxxxxxx">link</a>
</body>
</html>
--------------------
C:/perm/monks/files/html/meeting/lunchtime-meeting/index.asp
<html>
<head>
<title>index.asp</title>
</head>
<body>
<a href="another changed url">link</a>
</body>
</html>
--------------------
C:/perm/monks/files/html/meeting/series/index.asp
<html>
<head>
<title>index.asp</title>
</head>
<body>
<a href="new url">link</a>
</body>
</html>
--------------------
[download]

update:
fixed some badly formed html in the dummy files.
update 2:
Note that the last two records in you xml refer to the same page and link but with a different 'new location'. The third record was therefore ignored.

Comment on Re^3: using xml and perl to perform a search and replace on html files Select or Download Code

Replies are listed 'Best First'.
Re^4: using xml and perl to perform a search and replace on html files by richill (Monk) on Mar 11, 2007 at 16:33 UTC
Thanks very much WFSP. It looks more elegant than my approach, its easier to read. Are there other benefits. At the time of writing I believed that: Toke parser is more useful for parsing html becasue its more robust then a simple regex. However there is a trade-off in speed. I was not sure if %lookup would be needed because xml simple sticks everything in an anon. hash anyway. I can use that as the look-up. But your approach was different and worked, so what was i wrong about. Thanks again for your help.	[reply]

Replies are listed 'Best First'.

Re^4: using xml and perl to perform a search and replace on html files
by richill (Monk) on Mar 11, 2007 at 16:33 UTC

Toke parser is more useful for parsing html becasue its more robust then a simple regex. However there is a trade-off in speed.

I was not sure if %lookup would be needed because xml simple sticks everything in an anon. hash anyway. I can use that as the look-up.

But your approach was different and worked, so what was i wrong about.

Thanks again for your help.

[reply]