Re: Download references list in pdf format with script

Working on the assumption that the references will only find one PDF (which I'm not entirely convinced of), the following code should give you a starting point.

#!/usr/bin/env perl

use strict;
use warnings;

use LWP::UserAgent;
use URI::Escape;
use File::Basename;

our $VERSION = '0.001';

my $agent_name = join '/' => basename($0), $VERSION;
my $query_base = 'https://duckduckgo.com/html/?q=';
my $pdf_re = qr{href="([^"]+\.pdf)"};

my $ua = LWP::UserAgent->new(agent => $agent_name);

while (<DATA>) {
    chomp;
    my $req = HTTP::Request->new(GET => $query_base . uri_escape($_));
    $req->content_type('text/html');

    my $res = $ua->request($req);

    if ($res->is_success) {
        print "Search successful.\n";

        if ($res->content =~ $pdf_re) {
            my $pdf_url = $1;
            print "PDF found: $pdf_url\n";
            process_pdf_url($pdf_url);
        }
        else {
            print "PDF not found!\n";
        }
    }
    else {
        print $res->status_line, "\n";
    }
}

sub process_pdf_url {
    my $pdf_url = shift;

    print "Stub - download $pdf_url,\n\trename, upload to database, et
+c.\n";

    return;
}

__DATA__
1.    Abilez O, Benharash P, Mehrotra M, Miyamoto E, Gale A, Picquet J
+, Xu C, Zarins C (2006) A novel culture system shows that stem cells 
+can be grown in 3D and under physiologic pulsatile conditions for tis
+sue engineering of vascular grafts. J Surg Res 132:170-178.
[download]

Output:

$ pm_web_search_pdf.pl
Search successful.
PDF found: http://med.stanford.edu/arts/arts_students/CVs/CV_abilez_09
+2007.pdf
Stub - download http://med.stanford.edu/arts/arts_students/CVs/CV_abil
+ez_092007.pdf,
    rename, upload to database, etc.
[download]

-- Ken

Comment on Re: Download references list in pdf format with script Select or Download Code

Replies are listed 'Best First'.
Re^2: Download references list in pdf format with script by bitingduck (Chaplain) on Oct 26, 2012 at 03:31 UTC
I suspect your code already runs into one of the big problems that the OP will have-- if OP is looking for the paper that's referenced, rather than things that contain the reference, it's likely to be behind a paywall. The simple "grab the first pdf" is likely to get some combination of papers that reference the paper the OP is looking for, and which may be behind paywalls, or CV's of the authors (which you snagged).	[reply]
Re^3: Download references list in pdf format with script by kcott (Archbishop) on Oct 26, 2012 at 04:06 UTC
The OP seemed to think that his references would find a direct match; I said I wasn't convinced of this assumption. It would probably be more useful to convey your knowledge of paywalls, etc. to the OP rather than to me. I just wrote some code based on the information provided. :-) -- Ken	[reply]
Re^4: Download references list in pdf format with script by bitingduck (Chaplain) on Oct 26, 2012 at 04:23 UTC
And a pretty decent start for him indeed. Unfortunately all of my experience dealing with papers behind paywalls is from hand searching, and having to go in to work to dl them. Some of them don't work when you're VPN'd into the network that has a license. There's enough information in all the refs to find them, and with any luck the OP is running inside a university or someplace that has a license and can constrain the search to PubMed or a similar archive.	[reply]


Syntactic Confectionery Delight
	PerlMonks