in reply to Changing .html to .asp

Well, this could range on up there with an extremely simple task (in perl or shell) to very difficult depending on just how complex your pages are. At any rate, before doing anything you should make a copy of all the files and tar/gzip them up somewhere in case something goes wrong. So, without further adieu and assuming you did tar/gzip I will start with shell since that is simplest for me at this time in my perl knowledge (or lack thereof...but I will try to put a perl one down too...which it is just a one-liner). Also note, that I am assuming the simplest form of your situation meaning that there are nothing odd about your href tags and that they are pretty standard. I am also assuming (lots of assumptions going on eh? :) that you will be working in the same directory with no sub-directories where these files reside and that they are all there. I am assuming they are all called *.html or *.asp.

user@yourbox /home/httpd/html ] $ ls *.html *.asp | while read file;do + newfile="$file.fixed";sed s/\.html/\.asp/ $file > $newfile;done

This lil one liner will change EVERYTHING that is .html to .asp in every $file and create a new file called $file.fixed. theoretically, this means you will have not lost the originals and you can check the originals against the new files. This transformation might be undesirable but since I don't know the specifics it is the best I can do at this point with a shell one liner.

Now for Perl. (I am fully up to any assistance as I am still learning Perl)

#!/usr/local/bin/perl -w use strict; use cwd; # open our current directory opendir(DD,".") or die("Cannot open ". getcwd ."\n"); # read our files getting only files named *.html # and *.asp my @FILES = grep(/(\.html|\.asp),readdir(FD)) or die ("Could not get d +irectory listing: $!\n"); # close our directory descriptor close(DD); # walk our array that has the file names in it we should go through. while ( @FILES ) { my $file = $_ ; my $newfile = "$file.fixed"; # open two descriptors # FD for our original files open(FD,$file) or die("Cannot open $file\n"; # NEW for our newly created (hopefully fixed) files. open(NEW,">>$newfile") or die ("Cannot open $newfile for writing\n +"); # walk each file while ( <FD> ) { chomp; # replace anything in a line with <a href in it # that has .html to .asp if ( grep(/<a href/) ) { my $newline =~ s/\.html/\.asp/; # append that to the new file. print NEW "$newline\n"; } else { # otherwise just append the old line into # the file. print NEW "$_\n"; } } # Close the descriptor for these files before # the next loop iteration close(NEW); close(FD); }
This is untested code as I just wrote it here in this window. So go through it first. I think the perl one would be way more thorough than the shell line. Anyway, I was rushed too. I gotta get back to work. If something is wrong I apologize. Good luck.

----------
- Jim

Replies are listed 'Best First'.
Re: Re: Changing .html to .asp
by Anonymous Monk on Apr 27, 2001 at 22:58 UTC
    Snafu, thanks a lot for your help, I think you are right on track. The only problem is that when I run the code I get an error saying that there are not enough arguments for grep on the line where grep is used. I have very limited knowledge of Perl and would appreciate any suggestions you have to fix this. All of your assumptions you made were correct. Thanks
      Here is the fixed copy with some print()'s in it. I tested this copy so it should work. There were a few problems with the code. The grep was missing a terminating parenthesis. It (the script) was also missing a few other things. Hope this helps.

      #!/usr/local/bin/perl -w # written By Jim Conner 4-27-2001 use strict; use vars qw(@FILES); use Cwd; $| = 1; my $changed = "0"; my $fcount = "0"; my $lcount = "0"; # open our current directory opendir(DD,".") or die("Cannot open ". getcwd ."\n"); # read our files getting only files named *.html # and *.asp @FILES = grep(/(\.html|\.asp)/,readdir(DD)) or die ("Could not get dir +ectory listing: $!\n"); if ( $#FILES == 0 ) { print "Sorry, there were no files in ". cwd ." that matched *.html + or *.asp\n"; exit(1); } else { print "There are $#FILES files in ". cwd ." that match *.html or * +.asp\n"; for ( @FILES ) { print "$_\n";; } print "Processing continuing...\n\n"; } # close our directory descriptor close(DD); # walk our array that has the file names in it we should go through. for ( @FILES ) { my $file = $_ ; print "Processing file: $file\n"; my $newfile = "$file-fixed"; print "Output file will be: $newfile\n"; # open two descriptors # FD for our original files open(FD,$file) or die("Cannot open $file: $!\n"); # NEW for our newly created (hopefully fixed) files. open(NEW,">>$newfile") or die ("Cannot open $newfile for writing: +$!\n"); # walk each file while ( <FD> ) { my $line = $_ ; chomp $line; # replace anything in a line with <a href in it # that has .html to .asp if ( grep(/a href.*\.html/,$line) ) { #print "Found a match -> "; #print "$line"; #print " <-...converting\n"; ( my $newline = $line ) =~ s/\.html/\.asp/g; #print "Substitution written was: [[ "; #print "$newline"; #print " ]]\n"; $changed = "1"; $lcount++; # append that to the new file. print NEW "$newline\n"; } else { # otherwise just append the old line into # the file. print NEW "$_\n"; } } # Close the descriptor for these files before # the next loop iteration close(NEW); close(FD); print "Finished with $file...\n\n"; if ( $changed == "1" ) { $fcount++; $changed = "0"; } } print "There were $#FILES total files\n"; print "There were $fcount files affected\n"; print "There were $lcount lines changed\n"; print "Have a nice day :) \n\n";

      Bear in mind that this only matches "a href" in lower case. If you have some address hypertext references in caps, this won't catch them without some tweaking.

      ----------
      - Jim

        Jim, thanks a lot for your help. What I am doing is content migration from .htmls to .asps. Im only grabbing everything thats within the body tags and copying it over to the new asp in the same directory/folder. Within the body tag content I want to change any link that has .html to .asp. Your code does that for me, but it prints the line out twice. Once with the .html extension and once with the .asp. I only want it to come back with the .asp. Do you know how I can get rid of the duplicate .html? Thanks
        open (FILE, $filename); while(<FILE>){ # walk each file my $line = $_ ; chomp $line; #grabbing and printing everything between the body tags if (/<body.*?>/i ... /<\/body.*?>/i){ # this is a body line # extract the body ##changing .html to .asp in the links if ( grep(/a href.*\.html/,$line) ){ (my $newline = $line ) =~ s/\.html/\.asp/g; print OUTFILE $newline . "\n"; } $body_temp = $_; $body_temp =~ s/(.*?)\<body\>(.*?)\<\/body\>/$2/i; chomp($body_temp); $body = "$body_temp" ; # Write the body to the output file print OUTFILE $body . "\n"; } } close(FILE);

        Edit by tye

        Can you explain to me how to catch hrefs that are in capital letters. I know it has something to do with /i, but Im not sure where to put that. Thanks
        s