Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules

cafaro has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm making a mechanized Perl script that collects images for a school project, by downloading them. These images are generated by PHP. My problem is that when I downloaded an image, Linux (Ubuntu distr.) doesn't recognizes it as an image (despite of all the file extensions I've tried so far), but as a plain text file. I've tried to download other PHP images, and it worked fine with downloading these and viewing them. Therefore I think the reason of my issue is the rather strange URL of the image. The URL consists of an HTML GET method, but with not subdirectory. However, when I downloaded these images manually with Firefox, it worked. Another odd thing is that, when I view the image - downloaded by Firefox - the image doesn't has a file extension, however, when i view the properties of this file, it see the type of this file is JPEG. The image URL is as follows:

        http://www.site.com/images/?id=345435
[download]

The Perl modules that I use:

    WWW::Mechanize; # to browse through the site
    LWP::UserAgent; # for downloading the images
[download]

The actual Perl code I use for downloading the images:

    # load modules
    WWW::Mechanize;
    LWP::UserAgent;
    
    # create new sessions
    $mechanize = WWW::Mechanize->new(autocheck => 1); # "autocheck => 
+1" will show possible errors
    $useragent = LWP::UserAgent->new;
    
    # define useragent
    $agent = "Mozilla/5.0";
    $mechanize->agent($useragent);
    $useragent->agent($agent);

    # define the url
    $mechanize->get("http://www.site.com/images/");
    
    # fetch the content
    $content = $mechanize->content();
    
    # get the image url by parsing the content with regular expression
+s
    $content =~ /<tr><td><img src="(.+)" alt="php-image" \/><\/td><\/t
+r>/; # the url will be extracted to $1

    # download the image
    $time = time();
    $useragent->mirror($1, "/home/cafaro/images/$time.jpg");
    
    # provide output
    print "The image ($time.jpg) has been saved.\n";
[download]

I hope I gave enough information. Cheers, cafaro

Comment on Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules Select or Download Code

Replies are listed 'Best First'.
Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by dwm042 (Priest) on Aug 27, 2007 at 18:09 UTC
This is untested code, but I use something similar at home when I scrape images: #!/usr/bin/perl use warnings; use strict; # # warning: untested code. # package main; use WWW::Mechanize; my $mechanize = WWW::Mechanize->new(autocheck => 1); # define useragent $agent = "Mozilla/5.0"; $mechanize->agent($useragent); # get the url # $mechanize->get("http://www.site.com/images/"); # # get the images # my @images = $mechanize->images(); foreach my $img (@images) { my $time = time(); my $filename = $time . ".jpg"; my $mech2 = WWW::Mechanize->new(autocheck => 1); # # Save the images # Use the fact that WWW:Mechanize objects are # overloaded LWP::UserAgent objects. # # please see: # http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm # and look for $mech->get # my $mech2->get($img->url(), ":content_file" => $filename ); } [download]	[reply] [d/l]
Re^2: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by Anonymous Monk on Aug 27, 2007 at 19:52 UTC
Ok thanks, $1 showed the correct URL, when i printed it. What I forgot to mention, is that the file size of the - with Perl - downloaded file is 0kb. When i show the image (on my hard drive) in my browser it return the path of the file. And what do you exactly mean by "checksum (md5/sha1)"? Thanks, cafaro	[reply]
Re^3: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by Joost (Canon) on Aug 27, 2007 at 21:17 UTC
You need the full url for $ua->mirror(). If the urls in the HTML source are relative, your regex won't create a full url to mirror. You should probably use WWW::Mechanize's find_image() or find_all_images() method: `for my $img ($www->find_all_images()) { $www->mirror($img->url_abs()); }` [download] update: you probably also don't need both an LWP::UserAgent and a WWW::Mechanize object, since WWW::Mechanize is a subclass of LWP::UserAgent. In fact, chances are, it will work better with just one WWW::Mechanize object. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^4: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by Anonymous Monk on Nov 27, 2009 at 05:34 UTC
Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by moritz (Cardinal) on Aug 27, 2007 at 18:05 UTC
My problem is that when I downloaded an image, Linux (Ubuntu distr.) doesn't recognizes it as an image Linux is not an image viewer - how exactly did you try to open it? What's the output of `file $image.jpg` (use a real image name)? My guess is that your regex doesn't work as you expect it to, `$1` contains an invalid URL and you get an error page mirrored. Did you try to print the matched URLs to STDOUT? And try this regex: `$content =~ m{<tr><td><img src="([^"]+)" alt="php-image" /></td></tr>}` Perl 6 in German -- Difficult Sudoku	[reply] [d/l] [select]
Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by EvanCarroll (Chaplain) on Aug 27, 2007 at 18:04 UTC
Don't parse HTML with regexes Make sure the image is in fact an image, and not the garbled binary of some cgi wrapper. Open it with the original browser that you viewed it with, if it works you should know what to do. If it doesn't checksum (md5/sha1) a copy you saved from your browser, and then a copy that you retrieved with perl. My bet is they are off. Evan Carroll www.EvanCarroll.com	[reply]
Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by bigmacbear (Monk) on Aug 27, 2007 at 20:30 UTC
The URL you cite is only possible if the server is set to provide a script of some sort (you seem to have discovered it's a PHP script already) when the directory `http://www.site.com/images/` is requested, and the `?id=345435` part is passed as an argument to the script. I don't think it's possible to enumerate the directory in this case, you need to know the ID of every image you want -- and the server may be set up so that the ID is different every time that image is to be requested. It's a sneaky way to attempt to prevent people from grabbing copyrighted pictures off a web site or linking to them from their own sites.	[reply]
Re: Downloading PHP-generated images using the LWP::UserAgent and WWW::Mechanize modules by mohanprabu2 (Initiate) on Nov 27, 2009 at 06:05 UTC
thanks..	[reply]