RE: Re: Backwards searching with regexps

Addressing all comments...

I am using this for a program which identifies and later removes banner ads from html. The first part identifies an ad using <samp>/<\s*a ^>*href\s*=^>*>.*?<\s*\/a\s*></samp>. From there I wish to suck in tags which surround the ad. For example, consider:

<center><a href=...ad...><img src=...ad...></a></center>

I want to consider the <center> tag to be part of the ad. In fact, I want to suck in any surrounding tag, including table elements, tables, etc. also ignoring whitespace, comments, and certain other tags (like <p>, <br>, etc). The closing tag is easy to identify, it's the preceeding one that's difficult.

lookbehind assertions won't work because perl only implements fixed-width lookbehinds.
Reversing the string would work, and doing two separate matches, one on the original string, and one on the reversed string, keeping track of positions using m//g. But this requires writing regexp's backwards. This makes my brain hurt.
I have tried the construction m/.{$pos}/g to mark the position $pos in a string, and discovered that code using this construction is approx. 4 times slower than code using \G. One way to do this would be to use m/^.{$pos}.../g and m/....{$pos}/g to identify a distance away from the beginning and end of the string, respectively. But this requires the regex engine to examine each and every character up to that position (I think), which is slow. A better way to do this would be to use something like the \G construction, but with the ability to mark arbitrary positions in the string. (i.e. matching to a substring without having to actually extract and copy the substring)
Matching the preceeding string first isn't an option because it's the ad that is important. anything could preceed it.

Thank you, o wise monks. The program in question is FilterProxy.

Comment on RE: Re: Backwards searching with regexps

Replies are listed 'Best First'.
RE: Backwards searching with regexps by chromatic (Archbishop) on Mar 22, 2000 at 23:06 UTC
If it were up to me, I would immediately move to HTML::Parser instead of trying to craft a regexp by hand. You'll keep your hair longer. HTML::TokeParser is another good option. Pragmatically, trying to match balanced tags (as HTML is supposed to be) with a regular expression is an exercise in futility for any data which you don't generate yourself, from another program. There are just too many corner cases which pop up surprisingly often.	[reply]

Replies are listed 'Best First'.

RE: Backwards searching with regexps
by chromatic (Archbishop) on Mar 22, 2000 at 23:06 UTC

HTML::Parser

HTML::TokeParser

Pragmatically, trying to match balanced tags (as HTML is supposed to be) with a regular expression is an exercise in futility for any data which you don't generate yourself, from another program. There are just too many corner cases which pop up surprisingly often.

[reply]