in reply to Re: Re: Changing .html to .asp
in thread Changing .html to .asp

Here is the fixed copy with some print()'s in it. I tested this copy so it should work. There were a few problems with the code. The grep was missing a terminating parenthesis. It (the script) was also missing a few other things. Hope this helps.

#!/usr/local/bin/perl -w # written By Jim Conner 4-27-2001 use strict; use vars qw(@FILES); use Cwd; $| = 1; my $changed = "0"; my $fcount = "0"; my $lcount = "0"; # open our current directory opendir(DD,".") or die("Cannot open ". getcwd ."\n"); # read our files getting only files named *.html # and *.asp @FILES = grep(/(\.html|\.asp)/,readdir(DD)) or die ("Could not get dir +ectory listing: $!\n"); if ( $#FILES == 0 ) { print "Sorry, there were no files in ". cwd ." that matched *.html + or *.asp\n"; exit(1); } else { print "There are $#FILES files in ". cwd ." that match *.html or * +.asp\n"; for ( @FILES ) { print "$_\n";; } print "Processing continuing...\n\n"; } # close our directory descriptor close(DD); # walk our array that has the file names in it we should go through. for ( @FILES ) { my $file = $_ ; print "Processing file: $file\n"; my $newfile = "$file-fixed"; print "Output file will be: $newfile\n"; # open two descriptors # FD for our original files open(FD,$file) or die("Cannot open $file: $!\n"); # NEW for our newly created (hopefully fixed) files. open(NEW,">>$newfile") or die ("Cannot open $newfile for writing: +$!\n"); # walk each file while ( <FD> ) { my $line = $_ ; chomp $line; # replace anything in a line with <a href in it # that has .html to .asp if ( grep(/a href.*\.html/,$line) ) { #print "Found a match -> "; #print "$line"; #print " <-...converting\n"; ( my $newline = $line ) =~ s/\.html/\.asp/g; #print "Substitution written was: [[ "; #print "$newline"; #print " ]]\n"; $changed = "1"; $lcount++; # append that to the new file. print NEW "$newline\n"; } else { # otherwise just append the old line into # the file. print NEW "$_\n"; } } # Close the descriptor for these files before # the next loop iteration close(NEW); close(FD); print "Finished with $file...\n\n"; if ( $changed == "1" ) { $fcount++; $changed = "0"; } } print "There were $#FILES total files\n"; print "There were $fcount files affected\n"; print "There were $lcount lines changed\n"; print "Have a nice day :) \n\n";

Bear in mind that this only matches "a href" in lower case. If you have some address hypertext references in caps, this won't catch them without some tweaking.

----------
- Jim

Replies are listed 'Best First'.
Re: Re: Re: Re: Changing .html to .asp
by Anonymous Monk on May 01, 2001 at 23:08 UTC
    s
Re: Re: Re: Re: Changing .html to .asp
by Anonymous Monk on Apr 30, 2001 at 22:48 UTC
    Jim, thanks a lot for your help. What I am doing is content migration from .htmls to .asps. Im only grabbing everything thats within the body tags and copying it over to the new asp in the same directory/folder. Within the body tag content I want to change any link that has .html to .asp. Your code does that for me, but it prints the line out twice. Once with the .html extension and once with the .asp. I only want it to come back with the .asp. Do you know how I can get rid of the duplicate .html? Thanks
    open (FILE, $filename); while(<FILE>){ # walk each file my $line = $_ ; chomp $line; #grabbing and printing everything between the body tags if (/<body.*?>/i ... /<\/body.*?>/i){ # this is a body line # extract the body ##changing .html to .asp in the links if ( grep(/a href.*\.html/,$line) ){ (my $newline = $line ) =~ s/\.html/\.asp/g; print OUTFILE $newline . "\n"; } $body_temp = $_; $body_temp =~ s/(.*?)\<body\>(.*?)\<\/body\>/$2/i; chomp($body_temp); $body = "$body_temp" ; # Write the body to the output file print OUTFILE $body . "\n"; } } close(FILE);

    Edit by tye

      Hmm. Odd, I went ahead and double checked it (my code) on my box and I can't seem to reproduce what you are seeing. However, I did notice a lil annoyance that I went ahead and fixed. Before getting to that though I will show you what I see when I run the script on my box:

      Do you mind pasting a screenshot of what you see with your output?

      Now, for the annoyance. I noticed unneeded linefeeds getting in the fixed files. Easy to fix. Go to the lines where it does this:

      
           61             ( my $newline = $line ) =~ s/\.html/\.asp/g;
           62
           63             #print "Substitution written was: [[ ";
           64             #print "$newline";
           65             #print " ]]\n";
           66
           67             $changed = "1";
           68             $lcount++;
           69
           70 # append that to the new file.
           71             print NEW "$newline\n";
           72         } else {
           73 # otherwise just append the old line into
           74 # the file.
           75             print NEW "$_\n";
           76         }
      

      Remove the '\n's in the print statements. That fixed that lil problem for me. As for the code you are writing to supplement what I have done, unfortunately I don't have a lot of time to look at it right now being that I am at work and should be...he hem..well, working =P. I will check it closer tonight (later for me...gotta spend time with my family. I usually wait till my wife goes to sleep to play).

      Good luck. If you want, we can talk more about it in real time in irc /server irc.openprojects.net #perl or we can continue to do this.

      ----------
      - Jim

      Ok. I found your problem. Here, let me show ya.
      ##changing .html to .asp in the links        
              if ( grep(/a href.*\.html/,$line) ){
                  (my $newline = $line ) =~ s/\.html/\.asp/g;            
              print OUTFILE $newline . "\n";
              }
      
      Ok, that is fine. Your problem, however, is after this. Bear in mind that you are going through these files line by line. Therefore, the line that you are replacing the old line with must be placed in the new file instead of the old line, right? So, keeping your above code in mind, you have just found a line that matches what you are looking for and have changed it. You have also printed that line to the newfile. But what do you do next?
               $body_temp = $_;
               $body_temp =~ s/(.*?)\<body\>(.*?)\<\/body\>/$2/i;
               chomp($body_temp);
      
               $body = "$body_temp" ;
              
               # Write the body to the output file
               print OUTFILE $body . "\n";
              }
      
      ...amongst all the other stuff you done with the html body tag, you printed the line again, because you never left that line before processing it through the stuff after your if statement. You won't leave that line until the end of the loop iteration. Therefore, you should use an 'else' block in your 'if-then' statement.

      e.g.

      while ( <FILE> ) {
          if ( this line matches this regex ) {
              change the line;
              # If you need to do something to this line
              # do it here
              print it to the new file;
          } else {
              # this line obviously does not match my regex
              # so ignore it (or do some more stuff to it) 
              # and move on to the next line.
              Stuff to do to the line I didn't have to change...
              print the old line to the new file
          }
      }
      
      See what Im doing? :)

      ----------
      - Jim

Re: Re: Re: Re: Changing .html to .asp
by Anonymous Monk on May 01, 2001 at 22:28 UTC
    Can you explain to me how to catch hrefs that are in capital letters. I know it has something to do with /i, but Im not sure where to put that. Thanks
      According to the book I have here it says to use the regex:

      /regex/i

      So, in that grep I gave you place an i at the end of it like this:

      @FILES = grep(/a href.*/i,readdir(DESCRIPTOR))...

      Try that. Let me know if it works. Again, I am at work so I can't test it till later.

      ----------
      - Jim