comment on

Hello experts, I am trying to get data from web using web API. I am getting the data but what I want is specific. I want to extract identifiers specific to PMIDs from the code, of which the XML format, looks something like this

 <Post rdf:about="http://www.connotea.org/user/lrlucena/uri/111b2eeb65
+471b9866c833929901564b"><title>The structure of scientific collaborat
+ion networks.</title><updated>2007-11-10T14:38:49Z</updated>
<uri><dcterms:URI rdf:about="http://www.ncbi.nlm.nih.gov/entrez/query.
+fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11149952&quer
+y_hl=22&itool=pubmed_docsum">
<dc:title>Entrez PubMed</dc:title><link>http://www.ncbi.nlm.nih.gov/en
+trez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11
+149952&query_hl=22&itool=pubmed_docsum</link><hash>111b2eeb65471b9866
+c833929901564b</hash>
<citation><rdf:Description><citationID>888849</citationID><prism:title
+>From the Cover: The structure of scientific collaboration networks</
+prism:title>
<dc:date>2001-01-16T00:00:00Z</dc:date><journalID>212176</journalID>
<prism:publicationName>Proc Natl Acad Sci U S A</prism:publicationName
+><prism:endingPage>409</prism:endingPage>
<doiResolver rdf:resource="http://dx.doi.org/10.1073/pnas.021544898"/>
+<dc:identifier>doi:10.1073/pnas.021544898</dc:identifier><pmidResolve
+r rdf:resource="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Ret
+rieve&db=pubmed&dopt=Abstract&list_uids=11149952"/><dc:identifier>PMI
+D: 11149952</dc:identifier></rdf:Description></citation></dcterms:URI
+></uri></Post>
[download]

When I try to extract the identifiers for which the code is something like this

use lib 'C:/Perl64/www-connotea-perl-0.1/lib/';
my $fn0="extracted-connotea-pubmedID-2.1.txt";
open (IN0, $fn0) or
    die "Can't open $fn0: $!\n";
open (FH, ">:utf8",'bookmarks_only_for_PubmedID.txt');
use lib '../lib';
use WWW::Connotea;
my $currentURI;
my $PMID;
my $c = WWW::Connotea->new(  user => 'myusername', password => '......
+..' );
    $c->authenticate;   ###  dies if log-in credentials are incorrect
while (<IN0>)
{  
    my $currentURI = $_;
    chomp($currentURI);
   my @tags = $c->posts_for(uri =>"$currentURI");
   die "No candidate related articles\n" unless @tags;
    print FH "$currentURI\n";
    foreach my $tag (@tags) {
     print FH "PMID: ";
     my $boo = $tag->bookmark();
     my $foo = $boo->citation();
     print FH $foo->identifiers(), "\n";
     my $bar = grep(/PMID:^/, $foo->identifiers());
     print FH $bar, "\n";
     # if ($foo->identifiers() =~ m/(PMID:^)^.(\d+^)^/)
         # {
            # print FH "$2\n";
         # }

}
}
close IN0;
close FH;
[download]

My issue is not about getting the data. The source file sometimes had either only the PMID or has two identifiers named doi and PMID. When I run this code to get the identifiers() I get both of them and the output looks something like this

http://www.ncbi.nlm.nih.gov/pubmed/15754555
PMID: PMID: 15754555
http://www.ncbi.nlm.nih.gov/pubmed/4012367
PMID: PMID: 4012367
http://www.ncbi.nlm.nih.gov/pubmed/20215333
PMID: doi:10.1093/fampra/cmq003PMID: 20215333
http://www.ncbi.nlm.nih.gov/pubmed/20429974
PMID: PMID: 20429974
http://www.ncbi.nlm.nih.gov/pubmed/20338007
PMID: doi:10.1111/j.1600-0838.2009.01081.xPMID: 20338007
http://www.ncbi.nlm.nih.gov/pubmed/17438827
PMID: PMID: 17438827
http://www.ncbi.nlm.nih.gov/pubmed/17447555
PMID: PMID: 17447555
http://www.ncbi.nlm.nih.gov/pubmed/17450784
PMID: PMID: 17450784
[download]

I want to have the output only with PMIDs something like this

http://www.ncbi.nlm.nih.gov/pubmed/15754555
PMID: 15754555
http://www.ncbi.nlm.nih.gov/pubmed/4012367
PMID: 4012367
http://www.ncbi.nlm.nih.gov/pubmed/20215333
PMID: 20215333
http://www.ncbi.nlm.nih.gov/pubmed/20429974
PMID: 20429974
http://www.ncbi.nlm.nih.gov/pubmed/20338007
PMID: 20338007
http://www.ncbi.nlm.nih.gov/pubmed/17438827
PMID: 17438827
http://www.ncbi.nlm.nih.gov/pubmed/17447555
PMID: 17447555
http://www.ncbi.nlm.nih.gov/pubmed/17450784
PMID: 17450784
[download]

I am trying to use regex but not quite sure where am I going wrong. Experts please help me. Thank you, Sammed

In reply to Extracting web data by smandape1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.