Does anyone know the best way to extract all links of the following format from an HTML page?
<a href = ...><img src= ...></img></a>
Essentially, I want all links that have an image as the "text" of an href. However, the crucial thing I need to know the src location of that image. All existing CPAN modules seem to toss that data away when they "textify" the link (the important img data inside the href is replaced with a useless "[IMG]" tag)
I've tinkered/experimented with everything from HTTP::Mechanize to Link::Extor to HTML::Tree to HTML::Parser to at least a dozen other things. I can't get anything to work right that doesn't textify first.
I really didn't want to homegrow a regex, but I'm running out of options.
Does anyone know the best way to do this?
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.