Using Tokeparser to replace image paths on retrieved remote html

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Mr and/or Miss Monks,

I'll try to get this right, I read the "How NOT to post" primer, I RTFM'd the manual, but it did no good because I have remarkably little experience with Modules and OOP, and most of what I read went right over my head. I have been working on this script (about 5 versions now) for 2 weeks and I'm not getting it any better then it was at the start, so I figured now was a good time to ask the guru's.

I apologize profusely for the length of this post and I'll try to keep it as short as I can..

I have a Script that I wrote, (hacked together from others bits of code and some of my own.). The job of this script is to get around a problem I have with SSL, my main server is all SSL, but the client servers at each end of this process are not. People keep bailing when they get a "you are about to leave a secure connection" message and as a result the logs and so forth are not completed.. (since they are done by the clients scripts and the direction to them is what causes the "you are about to leave a secure connection" message and that is causing people to say no to continuing...

So I wrote this script to retrieve a copy of the client servers scripts output, save it locally on my SSL server and display it from there,, since my server is fetching the clients scripts response, and passing along a query string on the URL it is fetching, it is triggering the client script to write the logs and send emails etc... but since the HTML displayed in the SSL is a local copy my server retrieved, it doesn't display the "you are about to leave a secure connection" message. does that make sense?

The script in question, retrieves the client scripts HTML output, parses it for image links, saves the HTML and any images in it locally to my server, then parses the HTML and changes any image paths to the location of the locally downloaded versions, the result being that the entire page can then be displayed entirely locally on my server with all of it served under my SSL cert. and while it is parsing the HTML, it removes any references to embed, applet and object tags (because I don't want any sus HTML saved to my server)

Anyway, due to my Perl ineptitude, the current script (I call it "PageGrabber.cgi") is a total hack, and it is limited in the HTML it works with, currently it can only pick up image tags in a standard format (I'm using regex), it can only pick up tags that are all on one line.. and it has numerous shortcomings as well.. the main one being that its very slow.

Anyway, I started working on a new version for those reasons, to see if I could make a script that uses parsing modules to achieve much of the work.

anyway, I am using LWP::Simple to retrieve the page and I save it to file:

######### The Page you want the script to fetch.######
my $return_URL = "http://123.123.123.123/cgi-bin/index.pl";
 # The location of the file the HTML is saved as.
my $file = 'D:/Inetpub/Scripts/fetchtest/images/url.html';
 # The URL to the local directory the images are saved in.
my $images_url = 'https://my-secure-domain/cgi-bin/fetchtest/images/';
 # The local server path to the directory the images are saved in.
my $images_path = 'D:/Inetpub/Scripts/fetchtest/images';
 # The local server path to the directory this script is in.
my $cgi_path = 'D:/Inetpub/Scripts/fetchtest';
 # The Local URL of the .HTML file.
my $html_file = 'http://192.168.0.4/cgi-bin/fetchtest/images/url.html'
+;
use strict;
use LWP::Simple;
use HTML::TokeParser;
use URI;
my $query_string = $ENV{"QUERY_STRING"};
my $return_query = "$return_URL?$query_string";

$return_query = "http://$return_query" unless $return_query =~ m{^http
+://};
my $url = URI->new($return_query);

# Get requested page and save it locally
#print "Retrieving $return_query...\n";
my $html = get($return_query) || '';

open(OUTPUT_FILE, ">$file") || die "Unable to open $file: $!";
flock(OUTPUT_FILE,2) or die "cannot lock file: $!";
print OUTPUT_FILE $html;
close(OUTPUT_FILE);
[download]

Then I am parsing the html file, and looking for the images, and grabbing them, to be saved locally. (if they don't already exist).

# Parse the Web page to identify images
my %imagefiles;                                               
my $parser = HTML::TokeParser->new(\$html);
while (my $img_info = $parser->get_tag('img')) {              
  my $image_name = $img_info->[1]->{'src'};                  
  my $image_url = URI->new($image_name);
  my $image_file = $image_url->abs($url);                   
  $imagefiles{$image_file} = 1;                              
}
  

# Retrieve all the images and save them locally
  foreach my $this_image (keys %imagefiles) {                
  $this_image =~ m{.*/(.*)$};                                 
  my $local_image_name = $1;                                  
  $local_image_name =~ tr/A-Za-z0-9./_/c;                    
  my $local_image_path_name = "$images_path/$local_image_name"; 

  my $image_data = get($this_image) || '';                   
  
     # Get the images after checking if they don't already exist.
      my $local_img = "$images_path/$local_image_name"; 
     unless(-e $local_image_path_name)
     { 
       
        my $image_data = get($this_image) || '';
        # Save copy of image locally
      open(OUTPUT_FILE, ">$local_image_path_name") || die "Unable to o
+pen $local_image_name: $!";
      binmode(OUTPUT_FILE);
      print OUTPUT_FILE $image_data;
      close(OUTPUT_FILE);
      #print "saved: $local_image_path_name<br>\n";
     }# end of if image exists.
     #print "local img is: $local_img<br>/n";
}
[download]

So now I have the HTML saved to a file, and all the images are saved locally as well. Now this next bit is one of the main bits I am concerned with. I open the file again, this time slurping the whole file into one line of html and text (removing the newlines). and in this file open, I strip out the bad tags, and use regex to change relative image links to my new local absolute links. (https) problem is that I can't figure out how to change the image links that are already absolute, but not pointing to my local https copies.... the regex I have tried never seem to work. Here is the code I am referring to.

open FILE, "$file"
  or die "Can't open receipt page html file for display $!\n";
  flock(FILE,2) or die "cannot lock file: $!";
  # now make all image paths local and absolute.
  undef $/;    # enable "slurp" mode
  my $line = <FILE>;    # whole file now here
  $line =~ s/\n/ /g;
  
  #Strip out potentially nasty stuff.
  $line =~ s/<embed(.*)?<\/embed>/<!-- removed embed \/\/-->/sig;
  $line =~ s/<applet(.*)?<\/applet>/<!-- removed applet \/\/-->/sig;
                $line =~ s/<object(.*)?<\/object>/<!-- removed object 
+\/\/-->/sig;
 
 $line =~ s/
       (<\s*(?:a|img|area)\b[^>]*(?:href|src)\s*=\s*
       ['"]?)
       ([^'"> ]+)
       (['"]?
       [^>]*>)
       /
       $1.sprintf("%s",URI->new($2)->abs($images_url)).$3
       /segix;
       # now print the lot to the browser.
   print "Content-type: text/html\n\n";
  print $line;
close (FILE);
[download]

Now I started thinking to myself that I was on the right track using LWP to get the page and the images, but I'm again using risky regex to change the image links over.. its not a good idea to do it this way I think.

So I started searching the web for hours and hours and settled on HTML::TokeParser; to read the file parse out the image tags and replace them, but no amount of reading the POD documentation, and the ton of stuff on the net could teach this newbie how to use tokeparser to do this.. (as I said at the start, I'm a newbie and my experiance with OOP is pretty much what I learned when I taught myself Javascript, which is to say not alot of experiance :-) I've only been writting Perl code for about one and a half years, with no other programming experiance other then teaching myself BASIC when I was 12 on a Commodore VIC20.

so anyway, then I thought if I could get the image urls into an array, I could use regex or file::basename to split the image name from the path.. and use a foreach loop on the array to loop on each image path, and replace that path with the new https image path. something like this:

my $https_img_path = 'https://123.123.123.255/images';
foreach my $image_url (@images){

# Split image from original path (relative or absolute)
my ($path, $image) = split.......

# Search the html ($file) for the images and replace them with my new 
+paths.
$file =~ s/$image_url/$https_img_path\/$image/ig;
} # end of foreach.
[download]

I am fairly sure with some playing around, I could get that to work.. but its iffy, slow and I don't know how to get the image paths into the array with LWP or tokeparser.. and I am sure that this script shouldn't be anywhere near as long as it is, and its opening the file too many times, (it opens it again after that last one to empty the file so its blank when not in use. (for safety.) opening the html file 3 times is probably the main reason this script is so slow to execute.. (that and its currently using Carp, strict, taint, warnings and diagnostics to help me along.)

So I beg the monks indulgence by asking for advice here on how I might accomplish this task in a manner that doesn't use regex if its avoidable.. in other words, I am hoping someone will give me some tips on how to use Tokeparser to swap all image paths to my local https path but keep the image names the same. (and can it also handle background images and stuff?) and possible to reduce the number of file opens.

One last question, can I make LWP pretend its IE? just in case the clients script is using some form of browser detection??? I doubt any browser detection script currently covers LWP::Simple.

Again I am sorry for the length of this post, particularly since its my first post to the monks, and I'd also appreciate any tips on improving my coding style.. Here is the full scripts code. (incomplete in that it doesn't change over pre existing absolute image paths to the new https path.)

(Incidently, the test server and the real one are both win2000 server SP2 with IIS5 and Activeperl 5.6. and I am using the file extension .plt which I setup to run as taint mode.

I am incrediably grateful to anyone that can help me in the right direction here, or improve my coding style (or lack thereof) or offer any other constructive critisim that results in my learning something.

kindest regards

Franki

The full code:


#!/usr/bin/perl 

#########################################################
####              CONFIGURATION PARAMETERS      ####
#########################################################
######### The Page you want the script to fetch.
my $return_URL = "http://203.59.39.226/crabbait/index.html";
#my $return_URL = "http://203.59.39.226/cgi-bin/test.pl";
 # The location of the file the html is saved as.
my $file = 'D:/Inetpub/Scripts/fetchtest/images/url.html';
 # The url to the local directory the images are saved in.
my $images_url = 'http://192.168.0.4/cgi-bin/fetchtest/images/';
 # The local server path to the directory the images are saved in.
my $images_path = 'D:/Inetpub/Scripts/fetchtest/images';
 # The local server path to the directory this script is in.
my $cgi_path = 'D:/Inetpub/Scripts/fetchtest';
 # The Local URL of the .html file.
my $html_file = 'http://192.168.0.4/cgi-bin/fetchtest/images/url.html'
+;
#########################################################
use strict;
use warnings;
use diagnostics;
use LWP::Simple;
use HTML::TokeParser;
use URI;
my $query_string = $ENV{"QUERY_STRING"};
my $return_query = "$return_URL?$query_string";

$return_query = "http://$return_query" unless $return_query =~ m{^http
+://};
my $url = URI->new($return_query);

# Get requested page and save it locally
my $html = get($return_query) || '';

open(OUTPUT_FILE, ">$file") || die "Unable to open $file: $!";
flock(OUTPUT_FILE,2) or die "cannot lock file: $!";
print OUTPUT_FILE $html;
close(OUTPUT_FILE);

# Parse the Web page to identify images
my %imagefiles;                                               
my $parser = HTML::TokeParser->new(\$html);
while (my $img_info = $parser->get_tag('img')) {              
  my $image_name = $img_info->[1]->{'src'};                  
  my $image_url = URI->new($image_name);
  my $image_file = $image_url->abs($url);                   
  $imagefiles{$image_file} = 1;                              
}
  

# Retrieve all the images and save them locally
  foreach my $this_image (keys %imagefiles) {                
  $this_image =~ m{.*/(.*)$};                                 
  my $local_image_name = $1;                                  
  $local_image_name =~ tr/A-Za-z0-9./_/c;                    
  my $local_image_path_name = "$images_path/$local_image_name"; 

  my $image_data = get($this_image) || '';                   
  
    # Get the images after checking if they don't already exist.
    my $local_img = "$images_path/$local_image_name"; 
    unless(-e $local_image_path_name)
     { 
     # Retrieve the page  
     my $image_data = get($this_image) || ''; 
     # Save copy of image locally
   open(OUTPUT_FILE, ">$local_image_path_name") || die "Unable to open
+ $local_image_name: $!";
   binmode(OUTPUT_FILE);
   print OUTPUT_FILE $image_data;
   close(OUTPUT_FILE);
   #print "saved: $local_image_path_name<br>\n";
     }# end of if image exists.
     #print "local img is: $local_img<br>/n";
}

open FILE, "$file"
  or die "Can't open receipt page html file for display $!\n";
  flock(FILE,2) or die "cannot lock file: $!";
  # now make all image paths local and absolute.
  undef $/;    # enable "slurp" mode
  my $line = <FILE>;    # whole file now here
  $line =~ s/\n/ /g;
  
  #Strip out potentially nasty stuff.
  $line =~ s/<embed(.*)?<\/embed>/<!-- removed embed \/\/-->/sig;
  $line =~ s/<applet(.*)?<\/applet>/<!-- removed applet \/\/-->/sig;
                $line =~ s/<object(.*)?<\/object>/<!-- removed object 
+\/\/-->/sig;
 
 $line =~ s/
       (<\s*(?:a|img|area)\b[^>]*(?:href|src)\s*=\s*
       ['"]?)
       ([^'"> ]+)
       (['"]?
       [^>]*>)
       /
       $1.sprintf("%s",URI->new($2)->abs($images_url)).$3
       /segix;
   print "Content-type: text/html\n\n";
  print $line;
close (FILE);

      # This empties the url file after the display is successful
      # Since the final read doesn't alter the actual file, its 
      # important to clear out the file after use so it can't
      # be accessed directly, instead of only via the script.
      open (FILE, ">$file");    
      flock(FILE,2) or die "cannot lock file: $!";
      close (FILE);
exit(0);
[download]

edited: Wed Aug 14 19:52:08 2002 by jeffa - added readmore tag

Comment on Using Tokeparser to replace image paths on retrieved remote html Select or Download Code