Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Im doing content migration (pushing the content of one set of web pages in .html format to another set of pages in .asp format) and while doing that I need to remove some old include files while keeping others. The only thing that these includes to be deleted have in common is that they all start with <!--#include virtual="/global_nav_(sometimes another specfic word is here).ssi"--> Here is the code I am using now.
open (FILE, $filename); while(<FILE>){ # walk each file my $line = $_ ; chomp $line; #grabbing and printing everything between the body tags if (/<body.*?>/i ... /<\/body.*?>/i){ # this is a body line # extract the body ##Changing .html to .asp at the end of every link that does no +t belong to an external link if ($_ !~ m/href=\"http:\/\/.+\.html/) { s/(href=.+?\.)html/$ +1asp/gi } ###this is where i need to find and delete those certai +n include files $body_temp = $_; $body_temp =~ s/(.*?)\<body\>(.*?)\<\/body\>/$2/i; chomp($body_temp); $body = "$body_temp" ; # Write the body to the output file print OUTFILE $body . "\n"; } } close(FILE);

Replies are listed 'Best First'.
Re: Removing include files
by AidanLee (Chaplain) on May 02, 2001 at 20:31 UTC

    I'm going to suggest this regex:

    s|<!--#include virtual="/global_nav_[\w]+\.ssi"-->||g

    which essentially replaces the matched value with nothing throughout the file (i used '|' as the regex delimeter to save ourselves from having to escape all the slashes inside the regex. also, append 'i' after the regex if you want your case insensitive matching again). This is also assuming there is ALWAYS another word at the end of "global_nav_" otherwise the [\w]+ should be a [\w]*.

      The poster mentioned that the word at the end *may* be there, and it may not be; so probably you want that quantifier to be a * (zero or more) not a + (one or more).

      Also, underscores count as word characters (in the default locale on my systems (US English) at least), so that regex to get rid of those includes should be:

      s|<!--#include virtual="/global_nav\w*\.ssi)"-->||g;

      If you're not *sure* it's all *word* characters, then perhaps change that \w* to a [^.]*?, but that's probably unnecessary.

      However, I think those are the ones the poster wants to keep; all others are to be deleted. (UPDATE oh, that's just wrong. I reread the post. Sorry AidanLee ... but my question still stands).

      I don't have a neat solution for this. I tried

      s|<!--#include virtual="(?!/global_nav\w*\.ssi" -->||g;

      (using zero-width negative lookahead, i.e. "anything that doesn't match this pattern") but it didn't work, and I can't figure out how to fix it. Something's not working right in my brain today, but a two-step thing like this might suffice:

      foreach (@line_in body) { # quasi-pseudocode if (|(<!--#include virtual="([^"]+)"-->|) { #$1 contains the whole match, # $2 what's in between the quotes # so delete $1 unless $2 matches the pattern s/$1// unless $2 =~ m|/global_nav\w*\.ssi|; } }

      HTH ... and if anybody can tell me where my thinking's leading me down with my first stab, do tell!

Re: Removing include files
by converter (Priest) on May 02, 2001 at 22:53 UTC
    If I understand what you're trying to do in this code, I think the following should work, although I included the standard disclaimer that this is not tested, and I may have totally misunderstood your intent.

    This code looks at each line of the input:

    • The substitution pattern at [3] is probably not the best solution, but I gave it some light testing and it seems to work Ok.
    • The lexical variable $line is omitted because it isn't used.
    • [1] If the input line matches an SSI include of the type that we don't want, we do nothing with the line, which omits it from the output.
    • [2] If the line is part of the body section, fix any anchors which refer to on-site .html files.
    • Finally, we print to the output.

    # always check the results of an open() open (FILE, $filename) or die "unable to open $filename for input: $!" +; while (<FILE>) { # [1] omit line from output if it matches: # (make `[^.]*' more specific if only certain words should match) next if m~<!--\s*#include virtual\s*=\s*"/global_nav_[^".]*.ssi"\s*- +->~; chomp; # [2] if (m!<body[^>]*>!i ... m!</body[^>]*>!i) { # modify body lines only # [3] replace .html with .asp in anchors in this line s~(href\s*=\s*") # capture beginning of anchor in $1 (?!http://) # _not_ followed by http:// ([^"]+) # capture base of filename in $2 \.html # match .html ~${1}${2}.asp~gix; # make substitution, global } print OUTFILE; }

    UPDATE: minor changes to substitution pattern. Thanks to japhy my regex hero.