Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Meta Data Harvesting Using perl

by jarviscrypter (Initiate)
on Jun 11, 2020 at 18:06 UTC ( #11117952=perlquestion: print w/replies, xml ) Need Help??

jarviscrypter has asked for the wisdom of the Perl Monks concerning the following question:

I have been trying to harvest metadata using OAI-PMH protocol and also using a perl based harvester called 'Harvey' but I seem to be stuck at the below issue. My last resumptionToken is "resumptionToken=2019-01-22T00:51:30Z!2037-01-01T00:00:00Z!!oai_dc!6045389!7351939!oai:union.ndltd.org:IBICT/aoi:localhost:jspui/2251" please let me know where I should add this on the Perl code for the process to continue? On the Perl code below, I see the variable resumptionToken at 4 places (Lines: 12, 21, 52,57).

$| = 1; my $baseURL = 'http://union.ndltd.org/OAI-PMH/'; my $filename = 'aaaaafna'; my $resumptionToken = '2019-01-22T00:51:30Z!2037-01-01T00:00:00Z!!oai_ +dc!6045389!7351939!oai:union.ndltd.org:IBICT/aoi:localhost:jspui/2251 +'; use LWP::UserAgent; $ua = LWP::UserAgent->new; # before running this script, execute: # export http_proxy=http://localhost:<port>/ where <port> is your +cntlm port $ua->env_proxy(); do { my $reqURL = $baseURL.'?verb=ListRecords&'.(($ eq '')?'metadataPref +ix=oai_dc':'resumptionToken='.$resumptionToken); # my $reqURL = $baseURL.'?verb=Identify'; my $req = HTTP::Request->new( GET => $reqURL ); print "Harvesting $reqURL\n"; my $state = 0; my $res; while ($state == 0) { $res = $ua->request($req); if ($res->code == 503) { my $sleep = $res->header ('Retry-After'); if (not defined ($sleep) || ($sleep < 0) || ($sleep > 86400)) { $state = 1;} else { print "Sleeping for $sleep seconds\n"; sleep ($sleep); } } else { $state = 1; } } my $content = $res->content; my $records = (split (/<metadata>/, $content))-1; print "Saving response with $records records to $filename.xml\n"; open (FILE, ">$filename.xml"); print FILE $content; close (FILE); $filename++; $resumptionToken = ''; if ($content =~ /<resumptionToken[^>]*>([^<]+)<\/resumptionToken>/) { $resumptionToken = $1; } } while ($resumptionToken ne '');

Replies are listed 'Best First'.
Re: Meta Data Harvesting Using perl
by marto (Cardinal) on Jun 11, 2020 at 19:42 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11117952]
Approved by mhearse
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2022-05-21 16:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (76 votes). Check out past polls.

    Notices?