Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Download references list in pdf format with script

by kcott (Archbishop)
on Oct 26, 2012 at 03:04 UTC ( [id://1000982]=note: print w/replies, xml ) Need Help??


in reply to Download references list in pdf format with script

Working on the assumption that the references will only find one PDF (which I'm not entirely convinced of), the following code should give you a starting point.

#!/usr/bin/env perl use strict; use warnings; use LWP::UserAgent; use URI::Escape; use File::Basename; our $VERSION = '0.001'; my $agent_name = join '/' => basename($0), $VERSION; my $query_base = 'https://duckduckgo.com/html/?q='; my $pdf_re = qr{href="([^"]+\.pdf)"}; my $ua = LWP::UserAgent->new(agent => $agent_name); while (<DATA>) { chomp; my $req = HTTP::Request->new(GET => $query_base . uri_escape($_)); $req->content_type('text/html'); my $res = $ua->request($req); if ($res->is_success) { print "Search successful.\n"; if ($res->content =~ $pdf_re) { my $pdf_url = $1; print "PDF found: $pdf_url\n"; process_pdf_url($pdf_url); } else { print "PDF not found!\n"; } } else { print $res->status_line, "\n"; } } sub process_pdf_url { my $pdf_url = shift; print "Stub - download $pdf_url,\n\trename, upload to database, et +c.\n"; return; } __DATA__ 1. Abilez O, Benharash P, Mehrotra M, Miyamoto E, Gale A, Picquet J +, Xu C, Zarins C (2006) A novel culture system shows that stem cells +can be grown in 3D and under physiologic pulsatile conditions for tis +sue engineering of vascular grafts. J Surg Res 132:170-178.

Output:

$ pm_web_search_pdf.pl Search successful. PDF found: http://med.stanford.edu/arts/arts_students/CVs/CV_abilez_09 +2007.pdf Stub - download http://med.stanford.edu/arts/arts_students/CVs/CV_abil +ez_092007.pdf, rename, upload to database, etc.

-- Ken

Replies are listed 'Best First'.
Re^2: Download references list in pdf format with script
by bitingduck (Chaplain) on Oct 26, 2012 at 03:31 UTC

    I suspect your code already runs into one of the big problems that the OP will have-- if OP is looking for the paper that's referenced, rather than things that contain the reference, it's likely to be behind a paywall. The simple "grab the first pdf" is likely to get some combination of papers that reference the paper the OP is looking for, and which may be behind paywalls, or CV's of the authors (which you snagged).

      The OP seemed to think that his references would find a direct match; I said I wasn't convinced of this assumption. It would probably be more useful to convey your knowledge of paywalls, etc. to the OP rather than to me.

      I just wrote some code based on the information provided. :-)

      -- Ken

        And a pretty decent start for him indeed. Unfortunately all of my experience dealing with papers behind paywalls is from hand searching, and having to go in to work to dl them. Some of them don't work when you're VPN'd into the network that has a license. There's enough information in all the refs to find them, and with any luck the OP is running inside a university or someplace that has a license and can constrain the search to PubMed or a similar archive.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1000982]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-19 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found