httptech has asked for the wisdom of the Perl Monks concerning the following question:

Ok, so I'm working on another node-grabbing routine. This time I want to parse my personal nodes to extract the titles of nodes I've added that are "original nodes". By original I mean not in reply to another node.

So I am testing using a local file and a loop that looks like this:

while (<>) { my $node; $node = $1 if /^<TR bgcolor=.*>([^<]*)<\/a>/io; next if $node =~ /^re:/io; print "Found original: $node\n" if $node; }
This works just fine. However I am wondering, as I often do, is there Another Way To Do This?

Basically I need to match all titles that don't start with "RE:". According to page 230 of Mastering Regular Expressions, this concept is called lookbehind, and it also tells me I Can't Do That In Perl.

So I find myself wondering; WWCABD? (What would chromatic and btrott do? <g>)

Replies are listed 'Best First'.
Re: lookbehind
by mdillon (Priest) on May 07, 2000 at 21:02 UTC
    that particular part of Mastering Regular Expressions is out of date and no longer correct. if i recall correctly, MRE is written against 5.004 and lookbehind was added to 5.005.

    the lookbehind syntax is as follows:

    0-width positive lookbehind assertion: (?<=pattern)
    0-width negative lookbehind assertion: (?<!pattern)

    however, i think what you want in this case is actually a negative lookahead assertion like the following:

    m!^<TR bgcolor=.*>(?!(?:re:\s*)+)([^<]*)</a>!i

    also, the 'o' modifier is not very useful in your examples, since there are no variables in the regexp to interpolate.

      I'm a little confused... I thought the "o" modifier was used only when you have no variables to interpolate. Doesn't it tell Perl to compile the regex one time instead of every time it's evalutated, thus speeding up the program? Please clarify the role of "o" because I must have missed something important.
        since a regular expression without variable interpolation doesn't have the potential to change between uses, Perl only ever compiles it once. however, when a regular expression contains a variable to be interpolated, the value of the variable can potentially be different every time, so Perl compiles the regular expression with the current value of the variable every time it is used. the 'o' modifier tells Perl that the values of interpolated variables should be treated as constant, so that Perl will only compile the regular expression 'once' no matter how many times you use it.

        i'm not sure whether or not using 'o' on a regexp without any variable interpolation adversely affects performance, but it is simply unnecessary.

Re: lookbehind
by chromatic (Archbishop) on May 09, 2000 at 19:12 UTC
    One thing chromatic might do is strip the HTML down to a manageable chunk with judicious use of split. Assuming you can pull the table out (easy enough to do, "These nodes all have stuff by user" is a good phrase), you can split on the <TR bgcolor= bit, resulting in an array of lines to parse.

    Pull off the HREF bit -- up to the closing angle bracket, and you'll have the node title at the start. The regex there checking for re (case insensitive) is exactly what I'd use.

    This requires you to pull in the whole page at once, though, but it won't be big enough to eat up a lot of memory.