Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is there a way to read through about 900 pages of .html files and change every link on each page from link.html to link.asp. I was thinking that if I looked for "a href" then chopped off the .html and added .asp it work. Im just not sure how to look for the "a href" thanks

Replies are listed 'Best First'.
(Ovid - parsing HTML) Re: Changing .html to .asp
by Ovid (Cardinal) on Apr 27, 2001 at 22:31 UTC
    Definitely using some form of parsing system. Do not not a regex for this. Here's some sample HTML from a document that is routinely sent to us:
    <A HREF=" 010410zed2.pdf " target="_blank"><B><font size="-2" face="arial">view cutting </b></A>
    Guess what? Browsers have no problem with that. It's annoying as heck, but it works. I found it difficult to follow even reading it, much less trying to write a regex that would parse stuff like that accurately. Play it safe. Use a parser.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Changing .html to .asp
by little (Curate) on Apr 27, 2001 at 22:04 UTC
    use HTML::Tokeparser, get all links, get the links attribute (href), look that string up if it contains htm or html or shtml and change that to the appropriate extension

    Have a nice day
    All decision is left to your taste
Re: Changing .html to .asp
by snafu (Chaplain) on Apr 27, 2001 at 22:28 UTC
    Well, this could range on up there with an extremely simple task (in perl or shell) to very difficult depending on just how complex your pages are. At any rate, before doing anything you should make a copy of all the files and tar/gzip them up somewhere in case something goes wrong. So, without further adieu and assuming you did tar/gzip I will start with shell since that is simplest for me at this time in my perl knowledge (or lack thereof...but I will try to put a perl one down too...which it is just a one-liner). Also note, that I am assuming the simplest form of your situation meaning that there are nothing odd about your href tags and that they are pretty standard. I am also assuming (lots of assumptions going on eh? :) that you will be working in the same directory with no sub-directories where these files reside and that they are all there. I am assuming they are all called *.html or *.asp.

    user@yourbox /home/httpd/html ] $ ls *.html *.asp | while read file;do + newfile="$file.fixed";sed s/\.html/\.asp/ $file > $newfile;done

    This lil one liner will change EVERYTHING that is .html to .asp in every $file and create a new file called $file.fixed. theoretically, this means you will have not lost the originals and you can check the originals against the new files. This transformation might be undesirable but since I don't know the specifics it is the best I can do at this point with a shell one liner.

    Now for Perl. (I am fully up to any assistance as I am still learning Perl)

    #!/usr/local/bin/perl -w use strict; use cwd; # open our current directory opendir(DD,".") or die("Cannot open ". getcwd ."\n"); # read our files getting only files named *.html # and *.asp my @FILES = grep(/(\.html|\.asp),readdir(FD)) or die ("Could not get d +irectory listing: $!\n"); # close our directory descriptor close(DD); # walk our array that has the file names in it we should go through. while ( @FILES ) { my $file = $_ ; my $newfile = "$file.fixed"; # open two descriptors # FD for our original files open(FD,$file) or die("Cannot open $file\n"; # NEW for our newly created (hopefully fixed) files. open(NEW,">>$newfile") or die ("Cannot open $newfile for writing\n +"); # walk each file while ( <FD> ) { chomp; # replace anything in a line with <a href in it # that has .html to .asp if ( grep(/<a href/) ) { my $newline =~ s/\.html/\.asp/; # append that to the new file. print NEW "$newline\n"; } else { # otherwise just append the old line into # the file. print NEW "$_\n"; } } # Close the descriptor for these files before # the next loop iteration close(NEW); close(FD); }
    This is untested code as I just wrote it here in this window. So go through it first. I think the perl one would be way more thorough than the shell line. Anyway, I was rushed too. I gotta get back to work. If something is wrong I apologize. Good luck.

    ----------
    - Jim

      Snafu, thanks a lot for your help, I think you are right on track. The only problem is that when I run the code I get an error saying that there are not enough arguments for grep on the line where grep is used. I have very limited knowledge of Perl and would appreciate any suggestions you have to fix this. All of your assumptions you made were correct. Thanks
        Here is the fixed copy with some print()'s in it. I tested this copy so it should work. There were a few problems with the code. The grep was missing a terminating parenthesis. It (the script) was also missing a few other things. Hope this helps.

        #!/usr/local/bin/perl -w # written By Jim Conner 4-27-2001 use strict; use vars qw(@FILES); use Cwd; $| = 1; my $changed = "0"; my $fcount = "0"; my $lcount = "0"; # open our current directory opendir(DD,".") or die("Cannot open ". getcwd ."\n"); # read our files getting only files named *.html # and *.asp @FILES = grep(/(\.html|\.asp)/,readdir(DD)) or die ("Could not get dir +ectory listing: $!\n"); if ( $#FILES == 0 ) { print "Sorry, there were no files in ". cwd ." that matched *.html + or *.asp\n"; exit(1); } else { print "There are $#FILES files in ". cwd ." that match *.html or * +.asp\n"; for ( @FILES ) { print "$_\n";; } print "Processing continuing...\n\n"; } # close our directory descriptor close(DD); # walk our array that has the file names in it we should go through. for ( @FILES ) { my $file = $_ ; print "Processing file: $file\n"; my $newfile = "$file-fixed"; print "Output file will be: $newfile\n"; # open two descriptors # FD for our original files open(FD,$file) or die("Cannot open $file: $!\n"); # NEW for our newly created (hopefully fixed) files. open(NEW,">>$newfile") or die ("Cannot open $newfile for writing: +$!\n"); # walk each file while ( <FD> ) { my $line = $_ ; chomp $line; # replace anything in a line with <a href in it # that has .html to .asp if ( grep(/a href.*\.html/,$line) ) { #print "Found a match -> "; #print "$line"; #print " <-...converting\n"; ( my $newline = $line ) =~ s/\.html/\.asp/g; #print "Substitution written was: [[ "; #print "$newline"; #print " ]]\n"; $changed = "1"; $lcount++; # append that to the new file. print NEW "$newline\n"; } else { # otherwise just append the old line into # the file. print NEW "$_\n"; } } # Close the descriptor for these files before # the next loop iteration close(NEW); close(FD); print "Finished with $file...\n\n"; if ( $changed == "1" ) { $fcount++; $changed = "0"; } } print "There were $#FILES total files\n"; print "There were $fcount files affected\n"; print "There were $lcount lines changed\n"; print "Have a nice day :) \n\n";

        Bear in mind that this only matches "a href" in lower case. If you have some address hypertext references in caps, this won't catch them without some tweaking.

        ----------
        - Jim

Re: Changing .html to .asp
by THRAK (Monk) on Apr 27, 2001 at 22:08 UTC
    If all you need/want to do is change every HREF from a ".html" to ".asp" extension, that should be fairly easy. Build a list of all of the files you need to change and then read that file. For each entry open the file and parse it line by line looking for HREF's. When you find a HREF, do a  s/.html/.asp/ on that line. Depending how clean your HTML is, this may/may not work as there could be multiple .html referenced on each line if it's not layed out well. If it's clean this sort of thing should work. If you have ugly HTML that is not formatted well you will have to refine things to make sure you only match what you want to change.

    -THRAK
    www.polarlava.com
      THRAK, just to avoid the risk of changing the documents content as well then better make sure in your substitution that you dont match &gt; inbetween HREF and .(s|p)?htm(l)? and ignore the case. :-)

      Have a nice day
      All decision is left to your taste