Extract file links from Google search results

Replies are listed 'Best First'.
Re: Extract file links from Google search results by blazar (Canon) on Jun 22, 2007 at 16:07 UTC
This code takes an html file with the results of a Google search and extracts the links to files of a specified type to a text file. I have found having such links in a text file useful for automating the download of the files of a specified type, say pdf files for example. First of all, let me tell you that however typical, this is a nice example for a CUfP, given the instructive value. Also, while I'm the prototypical guy yelling at newbies not to parse HTML with regexen, with some shame I admit that in the past when I needed "this sorta things" I used to do that myself, in oneliners. Of course I was not 100% concerned about reliability in those cases. In any case, well done! Then I have some remarks. First of all, and without entering in specific locations of the code proper, in some points you seem to factor apart the `"pdf"` extension, so that it seems to pave the way for improvements to the effect of letting a user specify one, and in some other ones you seem to hardcode it again. Also, the choice of variable names like `@pdflist` may be slightly misleading in the long run. `#usage googlestrip file:///C:/googlesearchresult.htm > urllist.txt` [download] Don't you think you would either want to pass the program an actual file to slurp in with Perl's own tools or a generic* url to get off the web? Also... I see nothing that's Google specific, so you may have made the whole thingy more agnostic namewise. `my $fileget = getstore($url,"tempfile.html");` [download] Why storing into an actual file on disc? Why not a simple `get()`? The file won't be huge anyway. But if really wanting it on disc, why hardcoding it? (Without unlinking it that I can see.) Why not use a File::Temp one instead? `my $suffix = substr($element,$offset,$filetypelen); if ($suffix =~ m/$filetype/) { push @pdflist, $element;` [download] You know, sometimes I feel people tend to abuse regexen where other tools (like substr or index) would do. But in this particular case you're getting it just the opposite. It smells slightly moronzillonic. Curiously enough, given your approach, the last test does use a match (with no `\Q`), whereas a simple `eq` would have been better suited. Also, the whole thing is much like a single grep. `my @list = sort @pdflist;` [download] And you have created yet another array just to hold some values, when one would have sufficed. `for my $url (@list) { next if ($url =~ m/\/s.pdf/); print $url; print "\n"; }` [download] Awkward regex (what are you trying to do anyway? I suppose this is an ad hoc* solution to a requirement of yours.) And awkward flow control. Why doing the check on the sorted list anyway, and not next to the previous one? All in all I'd rewrite your app in the following manner, which also behaves in a slightly different way: `#!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::SimpleLinkExtor; die "Usage: $0 URL <extension> [<extenstions>]\n" unless @ARGV >= 2; my $url=shift; my $wanted=join '\|', map quotemeta, @ARGV; $wanted=qr/\.(?:$wanted)$/; defined(my $html=get $url) or die "Couldn't get <$url>\n"; { local $,="\n"; print sort grep /$wanted/, HTML::SimpleLinkExtor->new->parse($html)->a; } __END__` [download]	[reply] [d/l] [select]
Re^2: Extract file links from Google search results by Scott7477 (Chaplain) on Jun 25, 2007 at 13:56 UTC
Thanks for the comments, blazar. I appreciate your taking the time to look at my code. I must admit that when it comes to regexes I am very much an newbie. With regard to getting generic urls, I wanted specifically the link urls's that a Google search generates, which tend to be long hairballs, which I didn't know how to handle well directly. In any case your comments have given me some ideas as for tools to use in the future. Ravenor See my Standard Code Disclaimer	[reply]
Re^3: Extract file links from Google search results by blazar (Canon) on Jun 25, 2007 at 14:13 UTC
I must admit that when it comes to regexes I am very much an newbie. /me too, that's why I generally try not to be too smart with them. Of course sometimes sharpening one's own skills is not too bad and a look at the documentation is well worth. With regard to getting generic urls, I wanted specifically the link urls's that a Google search generates, which tend to be long hairballs, which I didn't know how to handle well directly. Well, google urls can be as simple as `http://www.google.com/search?q=cool+perl+stuff`. In fact, notwithstanding FF's cool search box available at a cheap keybinding, I often find myself composing them manually. In that case two parameters that happen to be useful for me are `num` and `filter` as in `num=100&filter=0`. Of course, this has nothing to do with Perl...	[reply] [d/l] [select]
Re: Extract file links from Google search results by billisdog (Sexton) on Jun 25, 2007 at 15:07 UTC
One thing you should also keep in mind is that according to Google's TOS, they can start blocking any requests that appear to be coming from a script or as part of a search engine agglomerator. The limit, to be fair, is quite high- since some of us are perfectly capable of generating five thousand legitimate google searches in a day- but if you are building this as part of a web app that might rack up hundreds of searches a minute, you may find google no longer responds. That's why they have the Search API, which, surprise, limits you to a few thousand requests per day. :(	[reply]
Re^2: Extract file links from Google search results by Scott7477 (Chaplain) on Jun 25, 2007 at 15:22 UTC
Your point is well taken; how I get my Google pages to process is through manually downloading them as I am well aware of their TOS. Clearly, using the search API would be a cleaner way to accomplish my task; I was trying for a quick and dirty solution. I happen to have an API key myself; just haven't had the time to put together an app using it yet. Interesting to know that they won't cut you off too quickly... See my Standard Code Disclaimer	[reply]
Re: Extract file links from Google search results by Anonymous Monk on Jul 26, 2007 at 21:49 UTC
Something similar with the general-purpose xml/html extract utility xmlgrep: `GET -HUser-Agent:Mozilla/5.0 'http://www.google.com/search?hl=hr&q=bbbike&btnG=Google+pretraga&lr=' \| xmlgrep -parse-html -as-html '//a[@class="l"]/@href'` Basically, xmlgrep is just a grep which uses XPath expressions. It can be downloaded here.	[reply] [d/l]
Re: Extract file links from Google search results by Anonymous Monk on Jul 23, 2007 at 15:59 UTC
useful to know how to do, but wget has this as a standard feature.	[reply]