cafaro has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm making a mechanized Perl script that collects images for a school project, by downloading them. These images are generated by PHP. My problem is that when I downloaded an image, Linux (Ubuntu distr.) doesn't recognizes it as an image (despite of all the file extensions I've tried so far), but as a plain text file. I've tried to download other PHP images, and it worked fine with downloading these and viewing them. Therefore I think the reason of my issue is the rather strange URL of the image. The URL consists of an HTML GET method, but with not subdirectory. However, when I downloaded these images manually with Firefox, it worked. Another odd thing is that, when I view the image - downloaded by Firefox - the image doesn't has a file extension, however, when i view the properties of this file, it see the type of this file is JPEG. The image URL is as follows:
http://www.site.com/images/?id=345435
The Perl modules that I use:
WWW::Mechanize; # to browse through the site LWP::UserAgent; # for downloading the images
The actual Perl code I use for downloading the images:
# load modules WWW::Mechanize; LWP::UserAgent; # create new sessions $mechanize = WWW::Mechanize->new(autocheck => 1); # "autocheck => +1" will show possible errors $useragent = LWP::UserAgent->new; # define useragent $agent = "Mozilla/5.0"; $mechanize->agent($useragent); $useragent->agent($agent); # define the url $mechanize->get("http://www.site.com/images/"); # fetch the content $content = $mechanize->content(); # get the image url by parsing the content with regular expression +s $content =~ /<tr><td><img src="(.+)" alt="php-image" \/><\/td><\/t +r>/; # the url will be extracted to $1 # download the image $time = time(); $useragent->mirror($1, "/home/cafaro/images/$time.jpg"); # provide output print "The image ($time.jpg) has been saved.\n";
I hope I gave enough information. Cheers, cafaro

Replies are listed 'Best First'.
Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules
by dwm042 (Priest) on Aug 27, 2007 at 18:09 UTC
    This is untested code, but I use something similar at home when I scrape images:

    #!/usr/bin/perl use warnings; use strict; # # warning: untested code. # package main; use WWW::Mechanize; my $mechanize = WWW::Mechanize->new(autocheck => 1); # define useragent $agent = "Mozilla/5.0"; $mechanize->agent($useragent); # get the url # $mechanize->get("http://www.site.com/images/"); # # get the images # my @images = $mechanize->images(); foreach my $img (@images) { my $time = time(); my $filename = $time . ".jpg"; my $mech2 = WWW::Mechanize->new(autocheck => 1); # # Save the images # Use the fact that WWW:Mechanize objects are # overloaded LWP::UserAgent objects. # # please see: # http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm # and look for $mech->get # my $mech2->get($img->url(), ":content_file" => $filename ); }
      Ok thanks, $1 showed the correct URL, when i printed it. What I forgot to mention, is that the file size of the - with Perl - downloaded file is 0kb. When i show the image (on my hard drive) in my browser it return the path of the file. And what do you exactly mean by "checksum (md5/sha1)"? Thanks, cafaro
        You need the full url for $ua->mirror(). If the urls in the HTML source are relative, your regex won't create a full url to mirror.

        You should probably use WWW::Mechanize's find_image() or find_all_images() method:

        for my $img ($www->find_all_images()) { $www->mirror($img->url_abs()); }
        update: you probably also don't need both an LWP::UserAgent and a WWW::Mechanize object, since WWW::Mechanize is a subclass of LWP::UserAgent. In fact, chances are, it will work better with just one WWW::Mechanize object.

Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules
by moritz (Cardinal) on Aug 27, 2007 at 18:05 UTC
    My problem is that when I downloaded an image, Linux (Ubuntu distr.) doesn't recognizes it as an image

    Linux is not an image viewer - how exactly did you try to open it? What's the output of file $image.jpg (use a real image name)?

    My guess is that your regex doesn't work as you expect it to, $1 contains an invalid URL and you get an error page mirrored.

    Did you try to print the matched URLs to STDOUT?

    And try this regex:

    $content =~ m{<tr><td><img src="([^"]+)" alt="php-image" /></td></tr>}

Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules
by EvanCarroll (Chaplain) on Aug 27, 2007 at 18:04 UTC
    Don't parse HTML with regexes

    Make sure the image is in fact an image, and not the garbled binary of some cgi wrapper. Open it with the original browser that you viewed it with, if it works you should know what to do. If it doesn't checksum (md5/sha1) a copy you saved from your browser, and then a copy that you retrieved with perl.

    My bet is they are off.



    Evan Carroll
    www.EvanCarroll.com
Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules
by bigmacbear (Monk) on Aug 27, 2007 at 20:30 UTC

    The URL you cite is only possible if the server is set to provide a script of some sort (you seem to have discovered it's a PHP script already) when the directory http://www.site.com/images/ is requested, and the ?id=345435 part is passed as an argument to the script.

    I don't think it's possible to enumerate the directory in this case, you need to know the ID of every image you want -- and the server may be set up so that the ID is different every time that image is to be requested. It's a sneaky way to attempt to prevent people from grabbing copyrighted pictures off a web site or linking to them from their own sites.

Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules
by mohanprabu2 (Initiate) on Nov 27, 2009 at 06:05 UTC
    thanks..