chantstophacking has asked for the wisdom of the Perl Monks concerning the following question:

I have no choice but to use a regex for this problem. I'm using someone else's content management tool and I don't have access to the code to make changes in how it works. (Well, I do have access to the code, but I'm not going to change it with a new version due out any time now.)

I'm trying to clip out a section of an HTML stream, and I'm allowed to supply a regex that will be used to select the portion I want to keep. I'm not getting access to the real match operator (m//) so I cannot supply any options, I have to use a straight-up regex.

To make it worse, I want to use no greediness at the beginning, and then use greediness after that. But I have to supply one intact expression to do the job.

So consider this content stream...

bad stuff bad stuff <!--begin node--> good stuff <!--end node--> <!--begin node--> good stuff <!--end node--> bad stuff bad stuff

My job is to discard the stuff before the first occurrence of "begin node" and claim everything up to the last occurrence of "end node" (where I do not really want the comment lines, but I can accept them if necessary).

The methodology allows me to phrase a regex, and use parentheses to mark the section I want to keep.

The problem is that if I use a ".*" at the beginning as in:

 .*begin node...(.*)....end node

then the greediness grabs me only the last node in the sequence. And there is nothing characteristic just prior to the first "begin node" marker that allows me to anchor it further.

Just so you know, I tried omitting the initial ".*" and that simply resulted in grabbing the entire document for me.

There may not be a good answer for this, but I knew that if there is an answer at all, it would be mentioned to me here. (At least that's what I've been told :-))

Replies are listed 'Best First'.
Re: Must use regex, how to clip...
by bart (Canon) on Jan 08, 2003 at 10:45 UTC
    I'm not getting access to the real match operator (m//) so I cannot supply any options
    Oh yes you can. There's a way to provide options inside the regex, even for only part of the regex, by putting them between the question mark and the colon in a (?:PAT) (non-capturing parens) construct. For example, these are equivalent:
    /.*/si /(?si:.*)/
    For automatic help on the syntax, try print out a qr// pattern with whatever options you like.
    print qr/.*/si
    (?si-xm:.*)
    As you can see, there are even more possibilities using this syntax: you can locally disable certain options, even if they're globally set for the whole regex, by putting a "-" in front of the list of options you want disabled. See also perlre.
      ...thank you. *This* is exactly what I was hoping to discover. I'll review perlre again to see what I missed the last time. Thanks a million bart!
Re: Must use regex, how to clip...
by seattlejohn (Deacon) on Jan 08, 2003 at 05:59 UTC
    Can you use non-greedy matching at the beginning, e.g. like something like this:
    .*?<!--begin node-->(.*)<!--end node-->

    Or is .*? not permissible?

            $perlmonks{seattlejohn} = 'John Clyman';

Re: Must use regex, how to clip...
by graff (Chancellor) on Jan 08, 2003 at 06:27 UTC
    This is a perl question, right? (I mean, you are actually using perl to run your regex over the data, aren't you?) If not, then I'm sorry I misunderstood...

    Anyway, are you sure you can't use "index()", "rindex()" and "substr()" instead of regexes? (I guess "length()" could be helpful, too.) E.g.:

    $bgn_target = "<!--begin node-->"; $bgn_offset = index( $_, $bgn_target ) + length( $bgn_target ); $keep_length = rindex( $_, "<!--end node-->" ) - $bgn_offset; $keep_string = substr( $_, $bgn_offset, $keep_length );
    Okay, it's a bit clumsy, and could be done more compactly, but it's one way of doing the job, if it's available to you.

    update: Looking at your post again, I figure the above suggestion is totally off the mark -- oh well.

    Getting back to the regex... it may be that you don't need to worry at all about the stuff that precedes the first "begin node" signal -- just this much ought to match what you want to retain:

    /<!--begin node-->(.*)<!--end node-->/
    (that is, assuming that your regex engine -- whatever it is -- knows about using parens to capture part of a match)

    When you say you can't "supply any options", does this mean you can't use use the "s" qualifier on the match (so that "." matches new-lines as well as all other characters)? Or is this not an issue for you?

    (The whole setup as you describe it seems kinda cryptic and warped, like your working inside a totalitarian regime...)

      This is a perl question, right? (I mean, you are actually using perl to run your regex over the data, aren't you?)

      ...well it is a Perl question in this sense. I have a content management system (TWiki) and it is implemented in Perl. It allows a convention for syndicating other content through the use of a regex. I presume it is simply taking the argument I pass to it (through an %INCLUDE{} directive) and applying it to the incoming HTML stream.

      So I am not using the m// operator directly. I'm simply passing a regex, and I presume that the system puts my regex into the match operator for me.

      The whole thing is incompletely documented, so I have to make some guesses about how it's working.

      (The whole setup as you describe it seems kinda cryptic and warped, like your working inside a totalitarian regime...)

      in a way, that's true. But I'm only prevented from understanding it more fully by the fact that I'm too lazy to dig through the code right now to see what they're doing to my regex. I know that a new version is coming out any day, and I don't want to patch something that may be totally different in the next release. (On the other hand, if I solve this problem from the user level, that solution may need to change at the next rev anyway.) But I sort of hoped that maybe I'd find a clever way to write a Perl regex that solves my problem.