in reply to Extract file links from Google search results

This code takes an html file with the results of a Google search and extracts the links to files of a specified type to a text file. I have found having such links in a text file useful for automating the download of the files of a specified type, say pdf files for example.

First of all, let me tell you that however typical, this is a nice example for a CUfP, given the instructive value. Also, while I'm the prototypical guy yelling at newbies not to parse *HTML with regexen, with some shame I admit that in the past when I needed "this sorta things" I used to do that myself, in oneliners. Of course I was not 100% concerned about reliability in those cases. In any case, well done!

Then I have some remarks. First of all, and without entering in specific locations of the code proper, in some points you seem to factor apart the "pdf" extension, so that it seems to pave the way for improvements to the effect of letting a user specify one, and in some other ones you seem to hardcode it again. Also, the choice of variable names like @pdflist may be slightly misleading in the long run.

#usage googlestrip file:///C:/googlesearchresult.htm > urllist.txt

Don't you think you would either want to pass the program an actual file to slurp in with Perl's own tools or a generic url to get off the web? Also... I see nothing that's Google specific, so you may have made the whole thingy more agnostic namewise.

my $fileget = getstore($url,"tempfile.html");

Why storing into an actual file on disc? Why not a simple get()? The file won't be huge anyway. But if really wanting it on disc, why hardcoding it? (Without unlinking it that I can see.) Why not use a File::Temp one instead?

my $suffix = substr($element,$offset,$filetypelen); if ($suffix =~ m/$filetype/) { push @pdflist, $element;

You know, sometimes I feel people tend to abuse regexen where other tools (like substr or index) would do. But in this particular case you're getting it just the opposite. It smells slightly moronzillonic. Curiously enough, given your approach, the last test does use a match (with no \Q), whereas a simple eq would have been better suited.

Also, the whole thing is much like a single grep.

my @list = sort @pdflist;

And you have created yet another array just to hold some values, when one would have sufficed.

for my $url (@list) { next if ($url =~ m/\/s.*pdf/); print $url; print "\n"; }

Awkward regex (what are you trying to do anyway? I suppose this is an ad hoc solution to a requirement of yours.) And awkward flow control. Why doing the check on the sorted list anyway, and not next to the previous one?

All in all I'd rewrite your app in the following manner, which also behaves in a slightly different way:

#!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::SimpleLinkExtor; die "Usage: $0 URL <extension> [<extenstions>]\n" unless @ARGV >= 2; my $url=shift; my $wanted=join '|', map quotemeta, @ARGV; $wanted=qr/\.(?:$wanted)$/; defined(my $html=get $url) or die "Couldn't get <$url>\n"; { local $,="\n"; print sort grep /$wanted/, HTML::SimpleLinkExtor->new->parse($html)->a; } __END__

Replies are listed 'Best First'.
Re^2: Extract file links from Google search results
by Scott7477 (Chaplain) on Jun 25, 2007 at 13:56 UTC
    Thanks for the comments, blazar. I appreciate your taking the time to look at my code. I must admit that when it comes to regexes I am very much an newbie. With regard to getting generic urls, I wanted specifically the link urls's that a Google search generates, which tend to be long hairballs, which I didn't know how to handle well directly.

    In any case your comments have given me some ideas as for tools to use in the future.

    Ravenor
    See my Standard Code Disclaimer
      I must admit that when it comes to regexes I am very much an newbie.

      /me too, that's why I generally try not to be too smart with them. Of course sometimes sharpening one's own skills is not too bad and a look at the documentation is well worth.

      With regard to getting generic urls, I wanted specifically the link urls's that a Google search generates, which tend to be long hairballs, which I didn't know how to handle well directly.

      Well, google urls can be as simple as http://www.google.com/search?q=cool+perl+stuff. In fact, notwithstanding FF's cool search box available at a cheap keybinding, I often find myself composing them manually. In that case two parameters that happen to be useful for me are num and filter as in num=100&filter=0. Of course, this has nothing to do with Perl...