Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I need a regex for the below lines but i can only think of multiple solutions which seems wasteful.

In a file i may have:

gif"></td><td>XXX</td></tr> gif"></td><td>XXX</td></tr> gif"></td><td><b>XXX</td><td><b>yyy</td></tr>

I need to match the XXX section which may have the characters A-Z0-9()[]. yyy only exists for those lines with and will have the same character types has XXX but i want to ignore them.

I have the following:

m/gif\S+<td><b>(\S+)<\/td><td><b>[A-Z]/; m/gif\S+<td>(\S+)<\/td><td><b>[A-Z]/; m/gif\S+<td>(\S+)<\/td><\/tr>$/;

Thanks

Edited by planetscape - fixed bold, code, etc. markup

Replies are listed 'Best First'.
Re: Pattern Match Problem
by McDarren (Abbot) on Mar 20, 2006 at 12:08 UTC
    Maybe something like this:
    #!/usr/bin/perl -wl use strict; while (<DATA>) { my $wanted; if (($wanted) = $_ =~ m/^gif"><\/td><td>(?:<b>)?([A-Z0-9\(\)\[\]]+ +?)<\/td>/) { print $wanted; } } __DATA__ gif"></td><td>XXX</td></tr> gif"></td><td>XXX</td></tr> gif"></td><td><b>XXX</td><td><b>yyy</td></tr>
    Notes:

    (?:<b>)?
    - makes the <b> optional, and doesn't capture it

    ([A-Z0-9\(\)\[\]]+?)
    - is where you capture your wanted string. Note the trailing +? - this makes it non-greedy, so that it only captures up until the first </td>

    Having said the above, be aware that there are several CPAN modules available for parsing HTML.

    Hope this helps,
    Darren :)

      thanks for that introduced me to a few new regex tricks Thanks

        Please also pay attention to the reply which avoids embeddded escaped delimeters.

Re: Pattern Match Problem (OT Observation)
by ww (Archbishop) on Mar 20, 2006 at 14:28 UTC
    While understanding that you may be stuck with that .html, it has a couple problems:
    • One, like it or not, the current "wisdom" is to use css instead of presentation/rendering instructions like <b>.
    • Second, and perhaps more critical to the perl side of this, if the .html does use <b>, then it should (and may in the future) also use a </b>.
      Otherwise, it won't validate, and (more important) most modern browsers will render everything that follows as bold, in contrast to the behavior of older ones which -- in effect -- assumed that if they encountered a tag required to be balanced that was NOT balanced, they would close that tag at a </p> or </td> or </li> (etc).

    On the other hand, v5 or 6-generation browsers behave more in the manner of the Recent CB Messages,where unbalanced rendering tags foobar everthing which follows, until someone fixes it ("...lends the CB a </whatever"> tag.

    And the relevance to your perl? Well, if the source .html becomes compliant, then your solution may need to deal with more capable parsing of that .html than (simple) regexen accomodate... at which point your solution may come to be dependent upon using an appropriate module -- be that one of the .html parsers or something from the Balanced:: family.

Re: Pattern Match Problem
by TedPride (Priest) on Mar 20, 2006 at 18:15 UTC
    while (<DATA>) { print "$1\n" if m/gif"><\/td><td>(?:<b>)?([A-Z0-9\(\)\[\]]+)(?:<\/ +b>)?<\/td>(?:<td>(?:<b>)?([A-Z0-9\(\)\[\]]+)(?:<\/b>)?<\/td>)?<\/tr>/ +; } __DATA__ gif"></td><td>XXX</td></tr> gif"></td><td><b>XXX</td></tr> gif"></td><td><b>XXX</b></td></tr> gif"></td><td><b>XXX</td><td><b>YYY</td></tr>
    This should do what you want, and it takes care of all the <b> and </b> cases. Don't ask me to explain it, though.
Re: Pattern Match Problem
by Anonymous Monk on Mar 20, 2006 at 23:32 UTC
    $ perl -le' @x = ( q[gif"></td><td>XXX</td></tr>], q[gif"></td><td>XXX</td></tr>], q[gif"></td><td><b>XXX</td><td><b>yyy</td></tr>] ); for ( @x ) { print $1 if /gif">(?:<[^>]*>)*([^<]*)/ } ' XXX XXX XXX