Changing .html to .asp

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid - parsing HTML) Re: Changing .html to .asp by Ovid (Cardinal) on Apr 27, 2001 at 22:31 UTC
Definitely using some form of parsing system. Do not not a regex for this. Here's some sample HTML from a document that is routinely sent to us: `<A HREF=" 010410zed2.pdf " target="_blank"><B><font size="-2" face="arial">view cutting </b></A>` [download] Guess what? Browsers have no problem with that. It's annoying as heck, but it works. I found it difficult to follow even reading it, much less trying to write a regex that would parse stuff like that accurately. Play it safe. Use a parser. Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: Changing .html to .asp by little (Curate) on Apr 27, 2001 at 22:04 UTC
use HTML::Tokeparser, get all links, get the links attribute (href), look that string up if it contains htm or html or shtml and change that to the appropriate extension Have a nice day All decision is left to your taste	[reply]
Re: Changing .html to .asp by snafu (Chaplain) on Apr 27, 2001 at 22:28 UTC
Well, this could range on up there with an extremely simple task (in perl or shell) to very difficult depending on just how complex your pages are. At any rate, before doing anything you should make a copy of all the files and tar/gzip them up somewhere in case something goes wrong. So, without further adieu and assuming you did tar/gzip I will start with shell since that is simplest for me at this time in my perl knowledge (or lack thereof...but I will try to put a perl one down too...which it is just a one-liner). Also note, that I am assuming the simplest form of your situation meaning that there are nothing odd about your href tags and that they are pretty standard. I am also assuming (lots of assumptions going on eh? :) that you will be working in the same directory with no sub-directories where these files reside and that they are all there. I am assuming they are all called .html or .asp. `user@yourbox /home/httpd/html ] $ ls .html .asp \| while read file;do + newfile="$file.fixed";sed s/\.html/\.asp/ $file > $newfile;done` [download] This lil one liner will change EVERYTHING that is .html to .asp in every $file and create a new file called $file.fixed. theoretically, this means you will have not lost the originals and you can check the originals against the new files. This transformation might be undesirable but since I don't know the specifics it is the best I can do at this point with a shell one liner. Now for Perl. (I am fully up to any assistance as I am still learning Perl) #!/usr/local/bin/perl -w use strict; use cwd; # open our current directory opendir(DD,".") or die("Cannot open ". getcwd ."\n"); # read our files getting only files named .html # and .asp my @FILES = grep(/(\.html\|\.asp),readdir(FD)) or die ("Could not get d +irectory listing: $!\n"); # close our directory descriptor close(DD); # walk our array that has the file names in it we should go through. while ( @FILES ) { my $file = $_ ; my $newfile = "$file.fixed"; # open two descriptors # FD for our original files open(FD,$file) or die("Cannot open $file\n"; # NEW for our newly created (hopefully fixed) files. open(NEW,">>$newfile") or die ("Cannot open $newfile for writing\n +"); # walk each file while ( <FD> ) { chomp; # replace anything in a line with <a href in it # that has .html to .asp if ( grep(/<a href/) ) { my $newline =~ s/\.html/\.asp/; # append that to the new file. print NEW "$newline\n"; } else { # otherwise just append the old line into # the file. print NEW "$_\n"; } } # Close the descriptor for these files before # the next loop iteration close(NEW); close(FD); } [download] This is untested code as I just wrote it here in this window. So go through it first. I think the perl one would be way more thorough than the shell line. Anyway, I was rushed too. I gotta get back to work. If something is wrong I apologize. Good luck. ---------- - Jim	[reply] [d/l] [select]
Re: Re: Changing .html to .asp by Anonymous Monk on Apr 27, 2001 at 22:58 UTC
Snafu, thanks a lot for your help, I think you are right on track. The only problem is that when I run the code I get an error saying that there are not enough arguments for grep on the line where grep is used. I have very limited knowledge of Perl and would appreciate any suggestions you have to fix this. All of your assumptions you made were correct. Thanks	[reply]
Re: Re: Re: Changing .html to .asp by snafu (Chaplain) on Apr 28, 2001 at 04:45 UTC
Here is the fixed copy with some print()'s in it. I tested this copy so it should work. There were a few problems with the code. The grep was missing a terminating parenthesis. It (the script) was also missing a few other things. Hope this helps. #!/usr/local/bin/perl -w # written By Jim Conner 4-27-2001 use strict; use vars qw(@FILES); use Cwd; $\| = 1; my $changed = "0"; my $fcount = "0"; my $lcount = "0"; # open our current directory opendir(DD,".") or die("Cannot open ". getcwd ."\n"); # read our files getting only files named .html # and .asp @FILES = grep(/(\.html\|\.asp)/,readdir(DD)) or die ("Could not get dir +ectory listing: $!\n"); if ( $#FILES == 0 ) { print "Sorry, there were no files in ". cwd ." that matched .html + or .asp\n"; exit(1); } else { print "There are $#FILES files in ". cwd ." that match .html or +.asp\n"; for ( @FILES ) { print "$_\n";; } print "Processing continuing...\n\n"; } # close our directory descriptor close(DD); # walk our array that has the file names in it we should go through. for ( @FILES ) { my $file = $_ ; print "Processing file: $file\n"; my $newfile = "$file-fixed"; print "Output file will be: $newfile\n"; # open two descriptors # FD for our original files open(FD,$file) or die("Cannot open $file: $!\n"); # NEW for our newly created (hopefully fixed) files. open(NEW,">>$newfile") or die ("Cannot open $newfile for writing: +$!\n"); # walk each file while ( <FD> ) { my $line = $_ ; chomp $line; # replace anything in a line with <a href in it # that has .html to .asp if ( grep(/a href.*\.html/,$line) ) { #print "Found a match -> "; #print "$line"; #print " <-...converting\n"; ( my $newline = $line ) =~ s/\.html/\.asp/g; #print "Substitution written was: [[ "; #print "$newline"; #print " ]]\n"; $changed = "1"; $lcount++; # append that to the new file. print NEW "$newline\n"; } else { # otherwise just append the old line into # the file. print NEW "$_\n"; } } # Close the descriptor for these files before # the next loop iteration close(NEW); close(FD); print "Finished with $file...\n\n"; if ( $changed == "1" ) { $fcount++; $changed = "0"; } } print "There were $#FILES total files\n"; print "There were $fcount files affected\n"; print "There were $lcount lines changed\n"; print "Have a nice day :) \n\n"; [download] Bear in mind that this only matches "a href" in lower case. If you have some address hypertext references in caps, this won't catch them without some tweaking. ---------- - Jim	[reply] [d/l]
Re: Re: Re: Re: Changing .html to .asp by Anonymous Monk on Apr 30, 2001 at 22:48 UTC
Re: Re: Re: Re: Re: Changing .html to .asp by snafu (Chaplain) on May 01, 2001 at 00:49 UTC
Re: Re: Re: Re: Re: Changing .html to .asp by snafu (Chaplain) on May 01, 2001 at 10:45 UTC
Re: Re: Re: Re: Changing .html to .asp by Anonymous Monk on May 01, 2001 at 22:28 UTC
Re: Re: Re: Re: Re: Changing .html to .asp by snafu (Chaplain) on May 02, 2001 at 00:37 UTC
Re: Re: Re: Re: Changing .html to .asp by Anonymous Monk on May 01, 2001 at 23:08 UTC
Re: Changing .html to .asp by THRAK (Monk) on Apr 27, 2001 at 22:08 UTC
If all you need/want to do is change every HREF from a ".html" to ".asp" extension, that should be fairly easy. Build a list of all of the files you need to change and then read that file. For each entry open the file and parse it line by line looking for HREF's. When you find a HREF, do a `s/.html/.asp/` on that line. Depending how clean your HTML is, this may/may not work as there could be multiple .html referenced on each line if it's not layed out well. If it's clean this sort of thing should work. If you have ugly HTML that is not formatted well you will have to refine things to make sure you only match what you want to change. -THRAK www.polarlava.com	[reply] [d/l]
Re: Re: Changing .html to .asp by little (Curate) on Apr 27, 2001 at 22:31 UTC
THRAK, just to avoid the risk of changing the documents content as well then better make sure in your substitution that you dont match > inbetween HREF and .(s\|p)?htm(l)? and ignore the case. :-) Have a nice day All decision is left to your taste	[reply]