Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I wish to search backwards from a given point in a large scalar using a regular expression. Using the <samp>pos</samp> function, and <samp>m//g</samp> I can set an arbitrary position in a scalar, and use <samp>m/\G.../g</samp> to search from the position forward. I want to search backwards from this point. Ideally, a construction like the following would work:
if($doc =~ /foo\G/g) { print "position preceeded by foo!\n"; }
but perl requires that <samp>\G</samp> be placed only at the beginning of the regex.

One option is to make a substring of the document up to this position, and then just use <samp>$</samp> to mark the end. But this is a large document (all in one scalar) that will be matched this way many times, and string copying is expensive.

Replies are listed 'Best First'.
Re: Backwards searching with regexps
by chromatic (Archbishop) on Mar 21, 2000 at 23:35 UTC
    Would you be willing to copy the string once? Use reverse on the scalar and match that way. (Whether or not this will work depends on the problem domain, but it's the easiest solution I can see, without using the very expensive $' and $` operators.)
Re: Backwards searching with regexps
by jerji (Novice) on Mar 22, 2000 at 00:53 UTC
    Another way would be to match the portion before it. For example:
    my $point = 15; if ($str =~ /^(.{15})/ && $1 =~ /(regex)$/) { print "stuff before pos 15: $1\n"; }
    This way, you at least don't have to copy the full string before searching. You only copy as much as length($1).

    If you know the approximate length of the string that will be matched, you can minimize what's copied even further by using substr():
    my $point = 15; my $start = 10; if (substr($str, $start, $point - $start) =~ /(regex)$/) { print "stuff before pos: $1\n"; }
    If you only need to know whether or not something matched, and not actually fetch the resulting match, you can use one of Perl 5.005's look-behind assertions:
    pos $str = 15; print "matched\n" if $str =~ /\G(?<=regex)/g;
    There are probably better ways to deal with this, but if you post an example string, a heuristic might arise that eliminates the need to do any of this.
      Addressing all comments...

      I am using this for a program which identifies and later removes banner ads from html. The first part identifies an ad using <samp>/<\s*a ^>*href\s*=^>*>.*?<\s*\/a\s*></samp>. From there I wish to suck in tags which surround the ad. For example, consider:

      <center><a href=...ad...><img src=...ad...></a></center>
      
      I want to consider the <center> tag to be part of the ad. In fact, I want to suck in any surrounding tag, including table elements, tables, etc. also ignoring whitespace, comments, and certain other tags (like <p>, <br>, etc). The closing tag is easy to identify, it's the preceeding one that's difficult.
      1. lookbehind assertions won't work because perl only implements fixed-width lookbehinds.
      2. Reversing the string would work, and doing two separate matches, one on the original string, and one on the reversed string, keeping track of positions using m//g. But this requires writing regexp's backwards. This makes my brain hurt.
      3. I have tried the construction m/.{$pos}/g to mark the position $pos in a string, and discovered that code using this construction is approx. 4 times slower than code using \G. One way to do this would be to use m/^.{$pos}.../g and m/....{$pos}/g to identify a distance away from the beginning and end of the string, respectively. But this requires the regex engine to examine each and every character up to that position (I think), which is slow. A better way to do this would be to use something like the \G construction, but with the ability to mark arbitrary positions in the string. (i.e. matching to a substring without having to actually extract and copy the substring)
      4. Matching the preceeding string first isn't an option because it's the ad that is important. anything could preceed it.

      Thank you, o wise monks. The program in question is FilterProxy.

        If it were up to me, I would immediately move to HTML::Parser instead of trying to craft a regexp by hand. You'll keep your hair longer. HTML::TokeParser is another good option.

        Pragmatically, trying to match balanced tags (as HTML is supposed to be) with a regular expression is an exercise in futility for any data which you don't generate yourself, from another program. There are just too many corner cases which pop up surprisingly often.

Re: Backwards searching with regexps
by hexram (Initiate) on Mar 22, 2000 at 00:32 UTC
    I wonder... May I state the problem as: Find any match of this pattern (say patA) that is followed by this other pattern (say patB)? Then, I suggest: Could you find first for patA - in your question it would be the pattern to be matched backwards - and then find next for patB - which you did match first - and:
    1. Possibly avoid issues like "match backwards right-to-left or left-to-right? - i.e. would I tell it to match 'tac' backwards to find previous 'cat'".
    2. Allow you to save this offset from the start of the scalar.
Re: Backwards searching with regexps
by dlowe (Initiate) on Mar 22, 2000 at 08:10 UTC
    If you're trying to slurp up the preceding and proceding tag indiscriminately, can't you just tack this onto the start of your regexp: '(<^<>*>^<>*)?' and this onto the end: '(^<>*</^<>*>)?' Or am I missing something? (My square brackets are being eaten, but hopefully you get the idea...)