Extracting web data

smandape1 has asked for the wisdom of the Perl Monks concerning the following question:

Hello experts, I am trying to get data from web using web API. I am getting the data but what I want is specific. I want to extract identifiers specific to PMIDs from the code, of which the XML format, looks something like this

 <Post rdf:about="http://www.connotea.org/user/lrlucena/uri/111b2eeb65
+471b9866c833929901564b"><title>The structure of scientific collaborat
+ion networks.</title><updated>2007-11-10T14:38:49Z</updated>
<uri><dcterms:URI rdf:about="http://www.ncbi.nlm.nih.gov/entrez/query.
+fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11149952&quer
+y_hl=22&itool=pubmed_docsum">
<dc:title>Entrez PubMed</dc:title><link>http://www.ncbi.nlm.nih.gov/en
+trez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11
+149952&query_hl=22&itool=pubmed_docsum</link><hash>111b2eeb65471b9866
+c833929901564b</hash>
<citation><rdf:Description><citationID>888849</citationID><prism:title
+>From the Cover: The structure of scientific collaboration networks</
+prism:title>
<dc:date>2001-01-16T00:00:00Z</dc:date><journalID>212176</journalID>
<prism:publicationName>Proc Natl Acad Sci U S A</prism:publicationName
+><prism:endingPage>409</prism:endingPage>
<doiResolver rdf:resource="http://dx.doi.org/10.1073/pnas.021544898"/>
+<dc:identifier>doi:10.1073/pnas.021544898</dc:identifier><pmidResolve
+r rdf:resource="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Ret
+rieve&db=pubmed&dopt=Abstract&list_uids=11149952"/><dc:identifier>PMI
+D: 11149952</dc:identifier></rdf:Description></citation></dcterms:URI
+></uri></Post>
[download]

When I try to extract the identifiers for which the code is something like this

use lib 'C:/Perl64/www-connotea-perl-0.1/lib/';
my $fn0="extracted-connotea-pubmedID-2.1.txt";
open (IN0, $fn0) or
    die "Can't open $fn0: $!\n";
open (FH, ">:utf8",'bookmarks_only_for_PubmedID.txt');
use lib '../lib';
use WWW::Connotea;
my $currentURI;
my $PMID;
my $c = WWW::Connotea->new(  user => 'myusername', password => '......
+..' );
    $c->authenticate;   ###  dies if log-in credentials are incorrect
while (<IN0>)
{  
    my $currentURI = $_;
    chomp($currentURI);
   my @tags = $c->posts_for(uri =>"$currentURI");
   die "No candidate related articles\n" unless @tags;
    print FH "$currentURI\n";
    foreach my $tag (@tags) {
     print FH "PMID: ";
     my $boo = $tag->bookmark();
     my $foo = $boo->citation();
     print FH $foo->identifiers(), "\n";
     my $bar = grep(/PMID:^/, $foo->identifiers());
     print FH $bar, "\n";
     # if ($foo->identifiers() =~ m/(PMID:^)^.(\d+^)^/)
         # {
            # print FH "$2\n";
         # }

}
}
close IN0;
close FH;
[download]

My issue is not about getting the data. The source file sometimes had either only the PMID or has two identifiers named doi and PMID. When I run this code to get the identifiers() I get both of them and the output looks something like this

http://www.ncbi.nlm.nih.gov/pubmed/15754555
PMID: PMID: 15754555
http://www.ncbi.nlm.nih.gov/pubmed/4012367
PMID: PMID: 4012367
http://www.ncbi.nlm.nih.gov/pubmed/20215333
PMID: doi:10.1093/fampra/cmq003PMID: 20215333
http://www.ncbi.nlm.nih.gov/pubmed/20429974
PMID: PMID: 20429974
http://www.ncbi.nlm.nih.gov/pubmed/20338007
PMID: doi:10.1111/j.1600-0838.2009.01081.xPMID: 20338007
http://www.ncbi.nlm.nih.gov/pubmed/17438827
PMID: PMID: 17438827
http://www.ncbi.nlm.nih.gov/pubmed/17447555
PMID: PMID: 17447555
http://www.ncbi.nlm.nih.gov/pubmed/17450784
PMID: PMID: 17450784
[download]

I want to have the output only with PMIDs something like this

http://www.ncbi.nlm.nih.gov/pubmed/15754555
PMID: 15754555
http://www.ncbi.nlm.nih.gov/pubmed/4012367
PMID: 4012367
http://www.ncbi.nlm.nih.gov/pubmed/20215333
PMID: 20215333
http://www.ncbi.nlm.nih.gov/pubmed/20429974
PMID: 20429974
http://www.ncbi.nlm.nih.gov/pubmed/20338007
PMID: 20338007
http://www.ncbi.nlm.nih.gov/pubmed/17438827
PMID: 17438827
http://www.ncbi.nlm.nih.gov/pubmed/17447555
PMID: 17447555
http://www.ncbi.nlm.nih.gov/pubmed/17450784
PMID: 17450784
[download]

I am trying to use regex but not quite sure where am I going wrong. Experts please help me. Thank you, Sammed

Comment on Extracting web data Select or Download Code

Replies are listed 'Best First'.
Re: Extracting web data by zek152 (Pilgrim) on Jun 13, 2011 at 14:19 UTC
Your use of '^' is causing problems. '^' marks the beginning of an line and '$' marks the end of a line ("the buck stops here"). If the PMID is always numeric simply use: `if($foo->identifiers() =~ /PMID: (\d+)/) {print FH "$1\n";}` [download] Update: reworded post.	[reply] [d/l]
Re^2: Extracting web data by smandape1 (Acolyte) on Jun 13, 2011 at 14:37 UTC
Thank you for your reply. Well, I tried using the above code but it doesn't seem to work. I getting the same output as below. `http://www.ncbi.nlm.nih.gov/pubmed/4012367 PMID: PMID: 4012367 http://www.ncbi.nlm.nih.gov/pubmed/20215333 PMID: doi:10.1093/fampra/cmq003PMID: 20215333 http://www.ncbi.nlm.nih.gov/pubmed/20429974 PMID: PMID: 20429974 http://www.ncbi.nlm.nih.gov/pubmed/20338007 PMID: doi:10.1111/j.1600-0838.2009.01081.xPMID: 20338007 http://www.ncbi.nlm.nih.gov/pubmed/17438827 PMID: PMID: 17438827 http://www.ncbi.nlm.nih.gov/pubmed/17447555 PMID: PMID: 17447555` [download] Also, the PMID is always a number.	[reply] [d/l]
Re^3: Extracting web data by zek152 (Pilgrim) on Jun 13, 2011 at 14:42 UTC
Did you uncomment the if statement and comment the other print? The code I posted cannot print non numeric data. If the regular expression I provided does not match anything then please provide exactly what `$foo->identifiers()` prints.	[reply] [d/l]
Re^4: Extracting web data by Anonymous Monk on Jun 13, 2011 at 17:02 UTC
Re^5: Extracting web data by zek152 (Pilgrim) on Jun 13, 2011 at 17:12 UTC
Some notes below your chosen depth have not been shown here