Addressing all comments...

I am using this for a program which identifies and later removes banner ads from html. The first part identifies an ad using <samp>/<\s*a ^>*href\s*=^>*>.*?<\s*\/a\s*></samp>. From there I wish to suck in tags which surround the ad. For example, consider:

<center><a href=...ad...><img src=...ad...></a></center>
I want to consider the <center> tag to be part of the ad. In fact, I want to suck in any surrounding tag, including table elements, tables, etc. also ignoring whitespace, comments, and certain other tags (like <p>, <br>, etc). The closing tag is easy to identify, it's the preceeding one that's difficult.
  1. lookbehind assertions won't work because perl only implements fixed-width lookbehinds.
  2. Reversing the string would work, and doing two separate matches, one on the original string, and one on the reversed string, keeping track of positions using m//g. But this requires writing regexp's backwards. This makes my brain hurt.
  3. I have tried the construction m/.{$pos}/g to mark the position $pos in a string, and discovered that code using this construction is approx. 4 times slower than code using \G. One way to do this would be to use m/^.{$pos}.../g and m/....{$pos}/g to identify a distance away from the beginning and end of the string, respectively. But this requires the regex engine to examine each and every character up to that position (I think), which is slow. A better way to do this would be to use something like the \G construction, but with the ability to mark arbitrary positions in the string. (i.e. matching to a substring without having to actually extract and copy the substring)
  4. Matching the preceeding string first isn't an option because it's the ad that is important. anything could preceed it.

Thank you, o wise monks. The program in question is FilterProxy.


In reply to RE: Re: Backwards searching with regexps by Anonymous Monk
in thread Backwards searching with regexps by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.