I'm looking for testers to test a quick script I threw together that I'm not sure how to improve upon anymore - it's a basic downloading tool I've created in the last two days. It's nothing fancy, it's useful to automate large amounts of downloads if you need a quick-and-dirty way to download tons of files without needing to babysit them, though. I built it out of necessity for my job, since I either needed to build a script that automated the download of over 2000 files, find one that works well for the task, or do it myself... I couldn't find one that worked for me, so I learned more about LWP and WWW::Mechanize, asked a couple questions from you great Monks, and have it working! It's currently being used to download said 2000+ files onto one of our computers, so I thought I'd ask for criticism on how to improve the meager tool, and release it into the wilderness of the internet.
It's available for viewing/download on my dropbox here, feel free to download it and tell me what you think. Are there glaring issues with it? Are there features you'd want added if you were going to use this tool? Be nice please, but feel free to critique.
Edit: Here's the code, upon advisement that I should post the code directly here for people to view.
#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::UserAgent;
use LWP::Simple;
use WWW::Mechanize;
use Digest::MD5 qw( md5_hex );
# Coded by Brendan Galvin from June 3rd 2013, to June 5th 2013.
# This product is open-source freeware, and credit to the original sou
+rce must be given to Brendan Galvin upon re-distribution of the origi
+nal source or any program, script or application made using the origi
+nal source.
# http://www.linkedin.com/pub/brendan-galvin/26/267/94b
my $flag=0;
print"\n\nURL for mass-downloading (only download links using the <a h
+ref> tag, not images or other embedded elements): ";
chomp(my $url = <STDIN>);
print"\nExtensions to download (seperated by comma's): ";
chomp(my $extensions = <STDIN>);
$extensions =~ s/[.]//g;
print"\nLocation to store downloaded files: ";
chomp(my $location = <STDIN>);
print"\nHow many downloads would you like to skip starting from the fi
+rst (in case you started this download earlier and have already downl
+oaded some of the files)? ";
chomp(my $skips = <STDIN>);
print"\nAre you going to want to skip any files while the program is r
+unning (y/n)?";
chomp(my $skiporno = <STDIN>);
my $error = "";
my @extension = split(',', $extensions);
my %extens = map{$_ => 1} @extension;
sub GetFileSize{
my $url=shift;
my $ua = new LWP::UserAgent;
$ua->agent("Mozilla/5.0");
my $req = new HTTP::Request 'HEAD' => $url;
$req->header('Accept' => 'text/html');
my $res = $ua->request($req);
if ($res->is_success) {
my $headers = $res->headers;
return $headers;
}else{
$flag = 1;
$error .= "Error retrieving file information at $url ";
}
return 0;
}
my $mech = WWW::Mechanize->new();
$mech->get($url);
my $base = $mech->base;
my @links = $mech->links();
for my $link ( @links ) {
my $skip = 'n';
if($link->url() =~ m/([^.]+)$/){
my $ext = ($link->url() =~ m/([^.]+)$/)[0];
if(exists($extens{$ext})){
my $newurl = $link->url();
if($newurl !~ /http::\/\/$/ig){
my $baseurl = URI->new_abs($newurl, $base);
$newurl = $baseurl;
}
my $filename = $newurl;
$filename =~ m/.*\/(.*)$/;
$filename = $1;
if($skips > 0){
$skips -= 1;
print "\n\nSkipped $filename at " . $link->url();
next;
}else{
my $header = GetFileSize($newurl);
my $urlmech = WWW::Mechanize->new();
$urlmech->show_progress("true");
print"\n\n\n$filename at $newurl\n";
print "File size: ".$header->content_length." bytes\n"
+ unless $flag==1;
print "Last modified: ".localtime($header->last_modifi
+ed)."\n" unless $flag==1;
if($skiporno eq 'y'){
print"Skip file (y/n)?";
chomp($skip = <STDIN>);
}
if($skip ne 'y'){
print " downloading...\n";
my $response = $urlmech->get($newurl, ':content_fi
+le' => "$filename", )->decoded_content;
}else{
print"\nSkipping...\n\n";
next or print"Error skipping.\n";
}
}
}
}
}
print"\n\n\nTasks completed.\n";
if($error ne ""){
print"\nErrors: $error";
}else{
print"No errors.\n";
}
Re: RFC: Script testers needed (ITYM code review)
by MidLifeXis (Monsignor) on Jun 06, 2013 at 13:13 UTC
|
It sounds like you are looking more for a code review, not a tester. If that is the case, you may get more responses with a different title.
If that is the case, post the code here, within <readmore><code>...</code></readmore> tags. PerlMonks seems to prefer having posts fairly well self contained, in order to allow for the content to be long lived.
That being said, I would probably start your options gathering with Getopt::Long or something similar. Prompting the user is not necessarily a bad thing, but not allowing for automation is (imo). That is all the further I got at this point.
| [reply] [d/l] |
|
I was actually planning on making this able to take arguments from the command line in a later version - but thanks for pointing out Getopt::Long! That looks like it's a really useful module.
I also took your advice and posted the code here on the post, for everyone to view within the Monastery. Hope that helps!
| [reply] |
Re: RFC: Script testers needed
by Athanasius (Archbishop) on Jun 06, 2013 at 16:10 UTC
|
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
Re: RFC: Code testers/reviewers needed
by hbm (Hermit) on Jun 07, 2013 at 12:48 UTC
|
Some minor suggestions:
- In GetFileSize, change this:
if ($res->is_success) {
my $headers = $res->headers;
return $headers;
}else{
$flag = 1;
$error .= "Error retrieving file information at $url ";
}
return 0;
To this:
return $res->headers if $res->is_success;
$flag = 1;
$error .= "Error ...";
return 0;
- Within your for loop, change this:
if(exists($extens{$ext})){
To this, and remove one level of indentation.
next unless exists $extens{$ext};
- Change delimiters rather than escaping. And in this example the global modifier isn't necessary.
#if($newurl !~ /http::\/\/$/ig){
if($newurl !~ m#http:://$#i){
- $skips -= 1; vs. $skips--;?
- Change this:
print"\n\n\n$filename at $newurl\n";
print "File size: ".$header->content_length." bytes\n" unless $flag==1
+;
print "Last modified: ".localtime($header->last_modified)."\n" unless
+$flag==1;
To this:
print "\n\n\n$filename at $newurl\n",
printf("File size: %d bytes\nLast modified: %s\n", $header->content_le
+ngth, localtime($header->last_modified)) unless $flag == 1;
- Spelling and grammar...
#print"\nExtensions to download (seperated by comma's): ";
print"\nExtensions to download (separated by commas): ";
| [reply] [d/l] [select] |
|
Great suggestions all - will definitely make the code more elegant. I haven't coded in Perl for a while, I guess I've forgotten how to make beautiful code :) thanks for the help!
| [reply] |
Re: RFC: Code testers/reviewers needed
by RichardK (Parson) on Jun 07, 2013 at 14:28 UTC
|
If you need this for your job, then why not just use curl?
curl has more features than you can shake a stick at, and just works -- no coding required :)
| [reply] |
|
Because at the time, due to Windows 8 sucking, curl wasn't working - so I made this, found out Windows 8 was causing everything not to work, transferred the program to a different computer that runs a different OS, and basically now we have our own tool already, so why not use it? :)
| [reply] |
Re: RFC: Code testers/reviewers needed
by vsespb (Chaplain) on Jun 07, 2013 at 08:49 UTC
|
1. $header->content_length is not available for chunked transfer (according to HTTP spec)
2. why are you mixing decoded_content and content_file ?
3. you are not checking if file downloaded correct (see mirror method in LWP::UserAgent)
4. is_success can give you false positives (but probably HEAD requests are OK in practice) https://github.com/libwww-perl/libwww-perl/issues/46 https://rt.cpan.org/Public/Bug/Display.html?id=85759
5. you can avoid HEAD requests at all - use LWP callbacks, you will be able to get length/last_mod_time before whole file is transferred
| [reply] |
|
All your points are good, I'll look into LWP callbacks so I can avoid HEAD requests - the mixing of decoded_content and content_file was due to me misunderstanding what decoded_content did. Can you elaborate what it does a little bit, for me?
| [reply] |
|
Documentation for decoded_content: http://search.cpan.org/perldoc?HTTP%3A%3AMessage
Note that there is no content at all, as file downloaded using content_file option.
| [reply] |
|
|