voyager has asked for the wisdom of the Perl Monks concerning the following question:

I pulled the contents of a web page using LWP::Simple. Now I want to extract a portion and I need help with the matching.
stuff i don't want <!---CURCON--> stuff i do want <!---END CURCON--> more stuff i don't want
my attempts:
$current =~ s/<!---CURCON-->(.*)<!---END CURCON-->/$1/s; $start = '<!---CURCON-->'; $end = '<!---END CURCON-->'; $current =~ s/$start(.*)$end/$1/s;
to no avail. I suspect I am not escaping some characters properly.

Replies are listed 'Best First'.
(jeffa) Re: regex in html
by jeffa (Bishop) on Apr 01, 2001 at 20:27 UTC
    You are close, first thing - you have to undefine the input record separator if you wish to slurp up entire blocks of lines, otherwise you will only get data up to the first new line encountered:
    undef $/; $current = <DATA>; $start = '<!---CURCON-->'; $end = '<!---END CURCON-->'; my ($match) = $current =~ m/$start(.*)$end/s; print $match; __DATA__ stuff i don't want <!---CURCON--> stuff i do want <!---END CURCON--> more stuff i don't want
    You are correcly using the 's' modifier for your regex, but instead of using s///, use m// and capture $1 in another variable. The trick is, you have to catch $1 in array context:
    my ($match) = $current =~ m/$start(.*)$end/s; # note the parens around + $match
    else $match will be equal to the number of matches found.

    Now $match will contain a newline at the beginning as well as one at the end:

    $match =~ tr/\n//d; # or $match =~ s/\n//g;
    Jeff

    R-R-R--R-R-R--R-R-R--R-R-R--R-R-R--
    L-L--L-L--L-L--L-L--L-L--L-L--L-L--
    
      timtowtdi...

      But also, I think Jeff misunderstands how your data is coming in, I'm assuming you're opening a file b4 the code you listed, and not using the __DATA__ token in your script.

      I don't like redefining $/, especially shown by Jeff, because it's not local and may cause issues later in your program. If you insist, use:

      # assuming DATA pipe opened for reading... # declare my $current; # begin local code block { # locally define $/ local $/ = undef; # slurp $current = <DATA>; # end local code block }
      For more on $/, see '6.7. Reading Records with a Pattern Seperator' in The Perl Cookbook.

      But I'd do it this way, anyway...

      # open open (DATA,"/path/to/webpage.htm") || die "Can't open page - $!"; # slurp $current = join '', (<DATA>); # close close(DATA); # match $current =~ /<!---CURCON-->\n(.*?)\n<!---CURCON-->s; # store my $match = $1;
      Jeff's match also grabs an extra \n at beginning and end which you may not need (small point :)

      hope this makes sense.

      cLive ;-)

        I think Jeff misunderstands how your data is coming in

        Nope. You said it right the first time: TIMTOWDTI ;)
        I mentioned the extra new-lines, I did not address them because I did not know EXACTLY how the data will look EVERY time - what if there are multiple blank lines?

        my ($match) = $current =~ m/$start\s*(.*)\s*$end/s;
        But thanks for sharing comments and critisicms, don't get me wrong, ++cLive ;-) :)

        Jeff

        R-R-R--R-R-R--R-R-R--R-R-R--R-R-R--
        L-L--L-L--L-L--L-L--L-L--L-L--L-L--
        
Re: regex in html
by Trimbach (Curate) on Apr 01, 2001 at 20:30 UTC
    Don't use substitution when you really just want to match:
    ($current) = $everything =~ m/<!---CURCON-->(.*)<!---END CURCON-->/s;
    ...should work just fine. $current will now contain everything between the comments. If you want to insert the contents of $current somewhere else, there's no need to use another regex:
    $new = $start . $current . $end;
    ...which will sandwich $current between $start and $end, which is what it looks like you want.

    Gary Blackburn
    Trained Killer

Re: regex in html + from ... to
by bjelli (Pilgrim) on Apr 02, 2001 at 13:39 UTC

    If you are processing big files you might want to avoid slurping the whole thing at once. Here the range operator <kbd>...</kbd> comes in handy. When used in a scalar context it returns a boolean and does just what you need here:

    while (<DATA>) { if (/$start/.../$end/) { print; } }

    I'll try to explain what happens in detail:

    The magic is in the three dots: When the first line is processed, the three dots are in the "false" state. They take the expression on the left (<kbd>/$start/</kbd>) and evaluate it. If the expression returns false everything stays the same, the three dots return false. If the expression returns true, the three dots return true and go into the "true" state.

    The next time we come to the three dots, the expression on the right is evaluated. If it returns false, everything says the same: the three dots continue to return true. If the expression returns true, the three dots go back into the true state.

    But once you've grokked all that, you just think of the whole while + if + ... construct as "from /$start/ to /$end/"

    --
    Brigitte    'I never met a chocolate I didnt like'    Jellinek
    http://www.horus.com/~bjelli/         http://perlwelt.horus.at

      The Camel 2nd Ed. States that the range operator is two dots '..' not three.

      ?

        It can be both .. and .... They have subtly different effects.

        With two dots, it's possible for both the start and end checks to be true on the same line. This means that the operator goes from false to true and back to false again on one evaluation. With three dots, if the start check is true, then the end check isn't checked until the next evaluation - thus forcing at least one iteration with the operator returning true.

        --
        <http://www.dave.org.uk>

        "Perl makes the fun jobs fun
        and the boring jobs bearable" - me