costas has asked for the wisdom of the Perl Monks concerning the following question:

i have written some code which uses several modules to extract all image links from a set of urls (taken from the google api) then convert all image links to their absolute urls.

THe code already exists here ta the office in 100% perfect working order, only it is written in Python and for several reasons it now needs to be ported over to perl.

I have written the code and it works for 90% of cases, however there are certain cases where the absolute urls have had folders cut out and the 'www' prefix chopped off.

the example of the page which does not work is

http://www.red11.org/mufc/images/player/beckham/

The code should fetch back the an image link as so:
http://www.red11.org/mufc/images/player/beckham/becksh98.jpg

but instead brings it back as so:
http://red11.org/mufc/becksh98.jpg

The code i am using is...
#loop through each url foreach my $row (@urlset) { parsedocument($row); } sub parsedocument { my ($url) = @_; print "$url<br>"; my $ua = LWP::UserAgent->new; $ua->env_proxy(); # Set up a callback that collect image links my @imgs = (); my $callback = sub { my($tag, %attr) = @_; return if $tag ne 'a'; # we only look closer at <img ...> push(@imgs, values %attr); }; my $p = HTML::LinkExtor->new($callback); # Request document and parse it as it arrives my $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; foreach my $row (@imgs) { if ($row =~/jpg$/) { print "$row<BR>"; } } }
thanks in advance

Replies are listed 'Best First'.
Re: retrieving the absolute url
by jeffenstein (Hermit) on May 21, 2002 at 11:25 UTC

    First, the comments in this block say one thing, but the code says something different

    # Set up a callback that collect image links my @imgs = (); my $callback = sub { my($tag, %attr) = @_; return if $tag ne 'a'; # we only look closer at <img ...> push(@imgs, values %attr); };

    You probably want to change this to:

    return if $tag ne 'img'; push(@imgs, $attr{'src'});
    or
    return if $tag ne 'a'; push(@imgs, $attr{'href'});

    Otherwise, if there are other attributes in the tag (alt=, border=) then you may end up getting their values instead.

    Also, the abs() method to URI::URL returns a URI::URL object, so you'll need to change this line:

    @imgs = map { $_ = url($_, $base)->abs; } @imgs;
    to this:
    @imgs = map { $_ = url($_, $base)->abs->as_string; } @imgs;

    Which will give you a string, that can be used instead. For some reason this only fails occasionally, so this may be the problem that you are running into.