Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Im doing content migration (pushing the content of one set of web pages in .html format to another set of pages in .asp format) and while doing that I need to change certain links from the old site from having a .html extension to having a .asp extension. Some links are links to external sites that I do not want to have changed from .html to .asp. The only thing these links have in common is that the have http:// in the beginning and .html at the end. So I need to search for those two keys, and when I find them, I just need to leave the line alone. Here is the code I am using now.
open (FILE, $filename); while(<FILE>){ # walk each file my $line = $_ ; chomp $line; #grabbing and printing everything between the body tags if (/<body.*?>/i ... /<\/body.*?>/i){ # this is a body line # extract the body ##########################3 ###### this is where i need to put an if else statement that does not +hing if http:// and .html are found in the same line ##Changing .html to .asp at the end of every link that do not +belong to an external link s/(href=.+?\.)html/$1asp/gi; $body_temp = $_; $body_temp =~ s/(.*?)\<body\>(.*?)\<\/body\>/$2/i; chomp($body_temp); $body = "$body_temp" ; # Write the body to the output file print OUTFILE $body . "\n"; } } close(FILE);

Replies are listed 'Best First'.
(tye)Re: Changing .html to .asp on some links but not others
by tye (Sage) on May 02, 2001 at 01:00 UTC

    First, you really should fix your web pages to put quotes around your hrefs: s/href=([^"\s>]+)/href="$1"/g because your current code will fail for cases like: <a href="http://www.google.com/">Google</a> uses .html files.

    to name just one of many problem cases.

    Easy way:

    s/(href="[^"]*)\.html"/$1.asp"/g s/(href="http://[^"]*)\.asp"/$1.html"/g
    but that breaks if you already have some external links that end in ".asp".

    Better:

    s{(href="[^"]*\.html")}{ my $s= $1; $s =~ s#\.html"$#.asp"# unless $s =~ m#^http:#; $s }ge;

    Oh, and I hope you don't have href=x outside of <a> tags in any of your web pages. (:

            - tye (but my friends call me "Tye")
Re: Changing .html to .asp on some links but not others
by Anonymous Monk on May 02, 2001 at 01:18 UTC
    Learn to use HTML::Filter. It might take a bit of work to figure it out, but it's worth it.
Re: Changing .html to .asp on some links but not others
by how do i know if the string is regular expression (Initiate) on May 02, 2001 at 01:07 UTC
Re: Changing .html to .asp on some links but not others
by astanley (Beadle) on May 02, 2001 at 00:50 UTC
    You need to add a regex check in the spot where you already determined it should be. I recommend it look like this:
    if ($_ !~ m/href=\"http:\/\/.+\.html/) { s/(href=.+?\.)html/$1asp/gi }
    Warning: the regex is untested but I believe it should work provided the links have quotations around the URL's.

    -Adam Stanley
    Nethosters, Inc.
      Given that he's reading one line at a time, this solution says:
      If the current line does not contain the string href="http://[anything].html then replace every instance (on this line) of href=[anything].html with href=[anything].asp
      If there are multiple links on one line, and any one of them is external, this solution will skip all links on that line. If a link spans more than one line, it will not be handled properly. After looking at this same problem posted here what, 4 times now, I'm thinking maybe the best approach is to use HTML::TokeParser, analyze each link (using the convenient get_tag method), and rewrite the HTML with CGI or something so it's nice and pretty.

      --isotope
      http://www.skylab.org/~isotope/