Addressing all comments...
I am using this for a program which identifies and later
removes banner ads from html. The first part identifies an
ad using <samp>/<\s*a ^>*href\s*=^>*>.*?<\s*\/a\s*></samp>.
From there I wish to suck in tags which surround the ad. For
example, consider:
<center><a href=...ad...><img src=...ad...></a></center>
I want to consider the <center> tag to be part of the
ad. In fact, I want to suck in
any surrounding tag,
including table elements, tables, etc. also ignoring whitespace,
comments, and certain other tags (like <p>, <br>, etc).
The closing tag is easy to identify, it's the preceeding one
that's difficult.
- lookbehind assertions won't work because perl only implements
fixed-width lookbehinds.
- Reversing the string would work, and doing two separate
matches, one on the original string, and one on the reversed
string, keeping track of positions using m//g. But this
requires writing regexp's backwards. This makes my
brain hurt.
- I have tried the construction m/.{$pos}/g to mark the position
$pos in a string, and discovered that code using this construction
is approx. 4 times slower than code using \G. One way to do
this would be to use m/^.{$pos}.../g and m/....{$pos}/g to
identify a distance away from the beginning and end of the
string, respectively. But this requires the regex engine
to examine each and every character up to that position (I
think), which is slow. A better way to do this would be
to use something like the \G construction, but with the
ability to mark arbitrary positions in the string. (i.e.
matching to a substring without having to actually extract
and copy the substring)
- Matching the preceeding string first isn't an option
because it's the ad that is important. anything could
preceed it.
Thank you, o wise monks. The program in question is
FilterProxy.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.