i have written some code which uses several modules to extract all image links from a set of urls (taken from the google api) then convert all image links to their absolute urls.
THe code already exists here ta the office in 100% perfect working order, only it is written in Python and for several reasons it now needs to be ported over to perl.
I have written the code and it works for 90% of cases, however there are certain cases where the absolute urls have had folders cut out and the 'www' prefix chopped off.
the example of the page which does not work is
http://www.red11.org/mufc/images/player/beckham/
The code should fetch back the an image link as so:
http://www.red11.org/mufc/images/player/beckham/becksh98.jpg
but instead brings it back as so:
http://red11.org/mufc/becksh98.jpg
The code i am using is...
#loop through each url
foreach my $row (@urlset)
{
parsedocument($row);
}
sub parsedocument
{
my ($url) = @_;
print "$url<br>";
my $ua = LWP::UserAgent->new;
$ua->env_proxy();
# Set up a callback that collect image links
my @imgs = ();
my $callback = sub {
my($tag, %attr) = @_;
return if $tag ne 'a'; # we only look closer at <img ...>
push(@imgs, values %attr);
};
my $p = HTML::LinkExtor->new($callback);
# Request document and parse it as it arrives
my $res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
foreach my $row (@imgs)
{
if ($row =~/jpg$/)
{
print "$row<BR>";
}
}
}
thanks in advance
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.