This code takes an html file with the results of a Google search and extracts the links to files of a specified type to a text file. I have found having such links in a text file useful for automating the download of the files of a specified type, say pdf files for example.
First of all, let me tell you that however typical, this is a nice example for a CUfP, given the instructive value. Also, while I'm the prototypical guy yelling at newbies not to parse *HTML with regexen, with some shame I admit that in the past when I needed "this sorta things" I used to do that myself, in oneliners. Of course I was not 100% concerned about reliability in those cases. In any case, well done!
Then I have some remarks. First of all, and without entering in specific locations of the code proper, in some points you seem to factor apart the "pdf" extension, so that it seems to pave the way for improvements to the effect of letting a user specify one, and in some other ones you seem to hardcode it again. Also, the choice of variable names like @pdflist may be slightly misleading in the long run.
#usage googlestrip file:///C:/googlesearchresult.htm > urllist.txt
Don't you think you would either want to pass the program an actual file to slurp in with Perl's own tools or a generic url to get off the web? Also... I see nothing that's Google specific, so you may have made the whole thingy more agnostic namewise.
my $fileget = getstore($url,"tempfile.html");
Why storing into an actual file on disc? Why not a simple get()? The file won't be huge anyway. But if really wanting it on disc, why hardcoding it? (Without unlinking it that I can see.) Why not use a File::Temp one instead?
my $suffix = substr($element,$offset,$filetypelen);
if ($suffix =~ m/$filetype/) {
push @pdflist, $element;
You know, sometimes I feel people tend to abuse regexen where other tools (like substr or index) would do. But in this particular case you're getting it just the opposite. It smells slightly moronzillonic. Curiously enough, given your approach, the last test does use a match (with no \Q), whereas a simple eq would have been better suited.
Also, the whole thing is much like a single grep.
my @list = sort @pdflist;
And you have created yet another array just to hold some values, when one would have sufficed.
for my $url (@list) {
next if ($url =~ m/\/s.*pdf/);
print $url;
print "\n";
}
Awkward regex (what are you trying to do anyway? I suppose this is an ad hoc solution to a requirement of yours.) And awkward flow control. Why doing the check on the sorted list anyway, and not next to the previous one?
All in all I'd rewrite your app in the following manner, which also behaves in a slightly different way:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::SimpleLinkExtor;
die "Usage: $0 URL <extension> [<extenstions>]\n" unless @ARGV >= 2;
my $url=shift;
my $wanted=join '|', map quotemeta, @ARGV;
$wanted=qr/\.(?:$wanted)$/;
defined(my $html=get $url) or die "Couldn't get <$url>\n";
{
local $,="\n";
print sort grep /$wanted/,
HTML::SimpleLinkExtor->new->parse($html)->a;
}
__END__
|