Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I might have simplified this too much, but: $html = '<!--% xxxx UNO xxxx %--> <!--% xxxx DOS xxxx %-->'; $html =~ s/<!--%(.*?) DOS (.*?)%-->/GONE/s; print $html; what I want to produce is leave the first comment tag, while replacing the second with GONE (and yes I do need to grab the xxxx tex +t) the 'xxxx' in both tags may be anything, and may span several lines I may not always be looking for 'DOS', this was just an example. my problem is that it chews up the first one, starting with <!--% from UNO's tag, and ending with --> from DOS's tag..

Replies are listed 'Best First'.
Re: UNO DOS
by lhoward (Vicar) on Sep 13, 2000 at 04:13 UTC
    You might be better off using HTML::Parser to parse out HTML tags, then apply your regular expression on a tag-by-tag basis. It will be very difficult to get your regular expression to work properly considering the variety and complexity that can occur in an HTML document.
      I would agree, unless this is the ONLY thing he wants to do.
Re: UNO DOS
by jreades (Friar) on Sep 13, 2000 at 03:29 UTC

    Your problem is the .*? -- although you rightly tried to limit the number of don't care characters using '?' your match still grabs the smallest number of don't care characters between '<!--' and 'DOS'... which just happens to include another '-->' and '<!--', the pieces it isn't supposed to grab.

    You'll need a few baselines to come up with a workable regexp:

    • Can xxxx ever include '<!--' or '-->'? (We'd better hope not)
    • Can xxxx contain only word-like characters (\w)?
    • Or can it include space characters as well (\s)?

    These would help you optimize your regexp...

    But the key point is that you need to limit your regexp to a single comment group containing 'DOS'.

    I'd suggest using:

    $html = '<!--% xxxx UNO xxxx %--> <!--% xxxx DOS xxxx %-->'; $html =~ s/<!--%([^->]+?) DOS ([^->]+)%-->/GONE/s; print $html;

    It's a little ugly, and notice that it assumes that your 'xxxx' can't contain '->', which may or may not be the case.

    YMMV

      Good call on the .* There is a node around here called Death to Dot Star! which explores this further. But your regex still needs work. The bracket elements are not a group, they are individual. meaning that it would catch any html item, not just -- >, because it matches the > alone. Ok?
      How about:
      $html =~ s/<!--%(?!%-->)DOS(?!%-->)%-->/GONE/s;
      I'm not sure about that regex, I've never used a zero-width negative look-ahead assertion, but I think that's the right direction.

        I've been sweating buckets about this one ever since I left the office... <visions of minus 30XP dancing in my head > which is, of course, just the time to realize that you screwed up the regexp. :^P

Re: UNO DOS
by Anonymous Monk on Sep 13, 2000 at 03:41 UTC
    Yeah that's part of the problem.. the xxxx can contain HTML code, which may contain regular comments, <!-- ---> however, xxxx will never contain <!--% and %-->, those are only used as braces.. and to mirod, another thing is that there could be any number of these tags before/after the tag we're intending to grab..
RE: UNO DOS, HTML
by runrig (Abbot) on Sep 13, 2000 at 04:02 UTC
    My attempt:

    $html =~ s/(<!--%(.*?)%-->)/($2=~m|DOS|)? 'GONE' : $1/esg;
RE: UNO DOS, HTML
by mirod (Canon) on Sep 13, 2000 at 03:14 UTC

    Just grab the first comment;

    $html = '<!--% xxxx UNO xxxx %--> <!--% xxxx DOS xxxx %-->'; $html =~ s/(<!--.*?-->\s*)<!--%(.*?) DOS (.*?)%-->/$1GONE/s; print $html;

    Or make sure there is a comment beforehand:

    $html = '<!--% xxxx UNO xxxx %--> <!--% xxxx DOS xxxx %-->'; $html =~ s/-->\s*<!--%(.*?) DOS (.*?)%-->/--> GONE/s; print $html;

    There is probably a cleaner way to do this without capturing the first comment at all, or by using a g modifier, skipping the first comment and replacing the second.

Re: UNO DOS
by Anonymous Monk on Sep 13, 2000 at 04:35 UTC
    well, sounds like I have to parse out each tag individually..
    yucky
    I guess I'll try to figure out some slob fix right now and rework the design of the templates..
      Instead of reworking "the design of the templates" why not use one of the many text/html templating modules already in place?
      My answer DOES parse each tag individually, it uses a regex inside a regex, and seems to work.
        yeah I got it.. I was trying to avoid that, but I guess I can't.. (?)
        thanks tho