damian has asked for the wisdom of the Perl Monks concerning the following question:

hi fella monks, a web grabber is useful when you are grabbing html pages that has no images on it. is it possible to grab the images as well? thanks

Replies are listed 'Best First'.
Re: Image grabber remotely hosted
by gregorovius (Friar) on Sep 04, 2000 at 10:58 UTC
    This is from the HTML::LinkExtor documentation:
    use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $url = "http://www.sn.no/"; # for instance $ua = new LWP::UserAgent; # Set up a callback that collect image links my @imgs = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'img'; # we only look closer at <img ...> push(@imgs, values %attr); } # Make the parser. Unfortunately, we don't know the base yet # (it might be diffent from $url) $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; # Print them out print join("\n", @imgs), "\n";
    Now instead of printing those urls, just fetch them using LWP.
      what if i don't want to use modules, as in pure perl? can this be done?

        Sure it can be done. But why bother when the modules are there to make your life easier?

        LWP::Simple, HTML::Parser (version 2) and HTML::LinkExtor are all pure-Perl modules, so if you want to know how to so it, just look at the source for these modules.

        --
        <http://www.dave.org.uk>

        European Perl Conference - Sept 22/24 2000, ICA, London
        <http://www.yapc.org/Europe/>
Re: Image grabber remotely hosted
by davorg (Chancellor) on Sep 04, 2000 at 10:29 UTC

    Yes, you would grab the original page (using one of the LWP modules) and then parse the file using HTML::LinkExtor to find the image links and send off a separate request for each link found.

    --
    <http://www.dave.org.uk>

    European Perl Conference - Sept 22/24 2000, ICA, London
    <http://www.yapc.org/Europe/>
      hi davorg, you mean i will be able to download even the images and save it on my server? thanks.

        Yep. It would be something like this (untested code):

        use strict; use LWP::Simple; use HTML::LinkExtor; my $url = 'http://www.example.com/index.html'; my $file = 'index.html'; getstore($url, $file); my $p = HTML::LinkExtor->new; $p->parse_file($file); my $i = '000'; while ($p->links) { next unless $_->[0] = 'img'; shift @$_; my %attrs = @$_; getstore($attrs{src}, "img$i"; ++$i; }

        Actually, thinking about it, that's not quite right as the image URLs that you'll get back in $attrs{src} will be relative to the main page so you might need to munge them a bit to get the absolute URL. Also by parsing the URL you could probably get a better image filename than the 'img00X' names that I'm using.

        --
        <http://www.dave.org.uk>

        European Perl Conference - Sept 22/24 2000, ICA, London
        <http://www.yapc.org/Europe/>
Re: Image grabber remotely hosted
by merlyn (Sage) on Sep 04, 2000 at 18:06 UTC
    Surprise, surprise: there's a module named Image::Grab that gives you all sorts of flexibility about grabbing images. You're not the first person who has wanted to do this, apparently. {grin}

    -- Randal L. Schwartz, Perl hacker

Re: Image grabber remotely hosted
by SuperCruncher (Pilgrim) on Sep 04, 2000 at 22:30 UTC
    There is a very useful example LWP script in the llama book which shows how to get a list of all the URLs in a page and download each URL. If you can to save them to a file rather than getting them in a var, you can use getstore (is this in LWP::Simple, anyone?). Getstore is trivial to use, just do getstore($url, $local_file);. I used the example code in the llama book as the base of one of my scripts, it was nearly worth the price of the book in itself (note to O'Reilly though: please don't jack the prices up any more!).

    what if i don't want to use modules, as in pure perl? can this be done?
    As far as I know, LWP is written in Perl, so your script would be "pure Perl" ::grin::.