in reply to Backwards searching with regexps

Another way would be to match the portion before it. For example:
my $point = 15; if ($str =~ /^(.{15})/ && $1 =~ /(regex)$/) { print "stuff before pos 15: $1\n"; }
This way, you at least don't have to copy the full string before searching. You only copy as much as length($1).

If you know the approximate length of the string that will be matched, you can minimize what's copied even further by using substr():
my $point = 15; my $start = 10; if (substr($str, $start, $point - $start) =~ /(regex)$/) { print "stuff before pos: $1\n"; }
If you only need to know whether or not something matched, and not actually fetch the resulting match, you can use one of Perl 5.005's look-behind assertions:
pos $str = 15; print "matched\n" if $str =~ /\G(?<=regex)/g;
There are probably better ways to deal with this, but if you post an example string, a heuristic might arise that eliminates the need to do any of this.

Replies are listed 'Best First'.
RE: Re: Backwards searching with regexps
by Anonymous Monk on Mar 22, 2000 at 05:50 UTC
    Addressing all comments...

    I am using this for a program which identifies and later removes banner ads from html. The first part identifies an ad using <samp>/<\s*a ^>*href\s*=^>*>.*?<\s*\/a\s*></samp>. From there I wish to suck in tags which surround the ad. For example, consider:

    <center><a href=...ad...><img src=...ad...></a></center>
    
    I want to consider the <center> tag to be part of the ad. In fact, I want to suck in any surrounding tag, including table elements, tables, etc. also ignoring whitespace, comments, and certain other tags (like <p>, <br>, etc). The closing tag is easy to identify, it's the preceeding one that's difficult.
    1. lookbehind assertions won't work because perl only implements fixed-width lookbehinds.
    2. Reversing the string would work, and doing two separate matches, one on the original string, and one on the reversed string, keeping track of positions using m//g. But this requires writing regexp's backwards. This makes my brain hurt.
    3. I have tried the construction m/.{$pos}/g to mark the position $pos in a string, and discovered that code using this construction is approx. 4 times slower than code using \G. One way to do this would be to use m/^.{$pos}.../g and m/....{$pos}/g to identify a distance away from the beginning and end of the string, respectively. But this requires the regex engine to examine each and every character up to that position (I think), which is slow. A better way to do this would be to use something like the \G construction, but with the ability to mark arbitrary positions in the string. (i.e. matching to a substring without having to actually extract and copy the substring)
    4. Matching the preceeding string first isn't an option because it's the ad that is important. anything could preceed it.

    Thank you, o wise monks. The program in question is FilterProxy.

      If it were up to me, I would immediately move to HTML::Parser instead of trying to craft a regexp by hand. You'll keep your hair longer. HTML::TokeParser is another good option.

      Pragmatically, trying to match balanced tags (as HTML is supposed to be) with a regular expression is an exercise in futility for any data which you don't generate yourself, from another program. There are just too many corner cases which pop up surprisingly often.