gryphon has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks...

This is a fairly silly regex question, but I'm loosing my mind over it. Please help. :) Right now I have:

/(\W|\b)($worda)(\W+)($wordb)(\W|\b)/

It's the center (\W+) that's the problem. Right now this matches most of what I need. However, it misses in when the text I'm searching has something like "worda <U>wordb".

What I'd like to do is have something like:

(\W+|<.?>|<.?>\W+|\W+<.?>|<.?>\W+<.?>|...)

In other words, find me \W+ but ignore all HTML tags (but include them in $3). See, I told you this was a silly and probably obvious question. But I'm stuck. Please help. Thanks. :)

-Gryphon.

Replies are listed 'Best First'.
(Ovid) Re: Silly Regex question...
by Ovid (Cardinal) on Jan 11, 2001 at 04:10 UTC
    Actually, it's not an obvious question until you get more familiar with regular expressions. In a nutshell: "don't use them to parse HTML." HTML is simply too complex and arbitrary for a reasonable regex to handle. You need to try something like HTML::Parser instead.

    If you go down the mis-guided path of trying to write regexes to parse HTML, eventually, you'll wind up with monstrosities like this:

    $data =~ s/ ( # Capture to $1 <a\s # <a and a space character (?: # Non-capturing parens [^>](?!href) # All non > not followe +d by href )* # zero or more of them href\s* # href followed by zero or +more space characters ) ( # Capture to $2 &#61;\s* # = plus zero or more space +s ( # Capture to $3 &[^;]+; # some HTML character c +ode (probably " or ') )? # which might not exist (?: # Non-grouping parens .(?!\3) # any character not fol +lowed by $3 )+ # one or more of them (?: \3 # $3 )? # (which may not exist) ) ( # Capture to $4 [^>]+ # Everything up to final > > # Final > ) /$1 . decode_entities($2) . $4/gsexi;
    Which I was trying to debug in this node.

    Sorry for the bad news.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      This answer would be more relevant if the original poster were actually asking how to parse HTML. But this is not what the original poster asked; using HTML::Parser to perform a straightforward substitution like this, on HTML that is known ahead of time, would be overkill. I think the fact that my regex met the original poster's needs is sufficient proof of this.
Re: Silly Regex question...
by chipmunk (Parson) on Jan 11, 2001 at 04:15 UTC
    How about this? /(\W|\b)($worda)([^\w<]*(?:<[^>]*>[^\w<]*)*)($wordb)(\W|\b)/ Match some non-word non-< characters, then some occurences of an HTML tag followed by some non-word non-< characters.
      Nope. Try that with this:
      <input type="text" name="something" value="-> input stuff here <-">

      Cheers,
      Ovid

      Update: In light of gryphon's clarification, this seems to be an instance where chipmunk's regex would be fine. It depends upon how the data is being read in and what will be done with this in the future as to whether or not it's the best solution.

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        The actual text being matched against didn't contain any occurences like the example you give above... And you know what? That's exactly what I expected.

        How often do you have HTML that contains angle brackets within attribute values?

      I just tested this out, and it seems to work in the test-cases that I provided. Thanks for your help!

Re: Silly Regex question...
by gryphon (Abbot) on Jan 11, 2001 at 04:22 UTC

    I should have been more specific. The "HTML" that this will be parsing isn't really HTML. It's just one tag: <U> or </U>

    So there's going to be a very limited number of cases where the above is important. Does this help at all?