Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

What I am doing is content migration from .htmls to .asps. Im only grabbing everything thats within the body tags and copying it over to the new asp in the same directory/folder. Within the body tag content I want to change any link that has .html to .asp. The code shown below does that for me, but it prints the link line out twice. Once with the .html extension and once with the .asp. I only want it to come back with the .asp. Do you know how I can get rid of the duplicate .html? Thanks
open (FILE, $filename); while(<FILE>){ # walk each file my $line = $_ ; chomp $line; #grabbing and printing everything between the body tags if (/<body.*?>/i ... /<\/body.*?>/i){ # this is a body line # extract the body ##changing .html to .asp in the links if ( grep(/a href.*\.html/,$line) ){ (my $newline = $line ) =~ s/\.html/\.asp/g; print OUTFILE $newline . "\n"; } $body_temp = $_; $body_temp =~ s/(.*?)\<body\>(.*?)\<\/body\>/$2/i; chomp($body_temp); $body = "$body_temp" ; # Write the body to the output file print OUTFILE $body . "\n"; } } close(FILE);

Replies are listed 'Best First'.
Re: Removing duplicate line
by chromatic (Archbishop) on Apr 30, 2001 at 23:12 UTC
    Yes, don't print $body if you print $newline.

    I'd rewrite your logic slightly, so you modify the line in place, using $_ instead of several unnecessary temporary variables. I could write the code for you, but you have to understand the logic first.

    Update: Here's a crack at better code:

    open (FILE, $filename) or die "Can't open $filename: $!"; while(<FILE>){ s/(a href=.+?\.)html/$1asp/g; s/(.*?)\<body\>(.*?)\<\/body\>/$2/i; print OUTFILE $_; } close(FILE);
    There's no need to chomp or to carry extra things around. I'm not thrilled with the second regex, but I figure it's there for a reason. :)

      Your second RegEx assumes the <body> and </body> tags are on the same line... and this is not always true.. check this for a better(?) solution..


      He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

      Chady | http://chady.net/
      I just started learning perl last week, so its basically all new to me. I think I understand what you are saying about modifying the line in place, and I definetly know that Im not supposed to print twice, but Im just not sure how else to do it. Any help would be really appreciated.
Re: Removing dublicate line
by astanley (Beadle) on Apr 30, 2001 at 23:21 UTC
    The reason you are getting the duplicates is because you print the $newline to the OUTFILE once you make the modification and then you print $_ to the OUTFILE regardless of whether there was a match or not. There are 2 courses of action I'd consider. One:

    In place of the grep for the href and .html, use a regex and s/// combo such as:
    if (m/a\shref/i) { s/.html/.asp/i }

    Your other option if you prefer to keep the grep that you are using now is to use a 'continue' at the end of the if block. IE:
    if ( grep(/a href.*\.html/,$line) ){ (my $newline = $line ) =~ s/\.html/\.asp/g; print OUTFILE $newline . "\n"; continue; }
    WARNING: Solution 2 is untested but I don't see anything standing immediately in the way of it working.

    UPDATE: In solution 2 change the 'continue' function to the 'next' function - don't know what I was thinking at the time. I apologize.

    -Adam Stanley
    Nethosters, Inc.
      thanks. Using "next" worked fine.
Re: Removing dublicate line
by rchiav (Deacon) on Apr 30, 2001 at 23:24 UTC
    I have a couple questions for you.. and they might help you see why it's not doing things the way you want.

    1) Why are you using 4 different variables to refer to the same thing? You use $_, $line, $newline and $body_temp all to refer to a line from the file. Why not just use one? I'd suggest either doing

    while ($line = <FILE>) { chomp $line; ...
    or
    while (<file>) { chomp; ...
    and then continue to use $_.

    2) now look at where you check to see if it's a link. At the end of that, you write your new line to OUTFILE. What do you do after that? You continue to run through the code and write output again. I'd suggest either using a next, or just modifiying $line or $_ (whichever you choose to use) or writing if..elsif statements.

    Hope this helps..
    Rich

Re: Removing duplicate line
by Chady (Priest) on Apr 30, 2001 at 23:44 UTC

    I would have done it like that :

    undef $/; # to absorb the whole file in a single read. open FILE, $filename; $all = <FILE>; $all =~ s/\.html/\.asp/g; # this will substitue the .html to .asp glob +ally # this takes the part in between the body tags... $all =~ s{^.*?\<body.*?\>(.*)\</body.*?\>.*?$} {$1}isx; print OUTFILE $all; close FILE;

    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/