Hello experts, I am trying to get data from web using web API. I am getting the data but what I want is specific. I want to extract identifiers specific to PMIDs from the code, of which the XML format, looks something like this

<Post rdf:about="http://www.connotea.org/user/lrlucena/uri/111b2eeb65 +471b9866c833929901564b"><title>The structure of scientific collaborat +ion networks.</title><updated>2007-11-10T14:38:49Z</updated> <uri><dcterms:URI rdf:about="http://www.ncbi.nlm.nih.gov/entrez/query. +fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11149952&quer +y_hl=22&itool=pubmed_docsum"> <dc:title>Entrez PubMed</dc:title><link>http://www.ncbi.nlm.nih.gov/en +trez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11 +149952&query_hl=22&itool=pubmed_docsum</link><hash>111b2eeb65471b9866 +c833929901564b</hash> <citation><rdf:Description><citationID>888849</citationID><prism:title +>From the Cover: The structure of scientific collaboration networks</ +prism:title> <dc:date>2001-01-16T00:00:00Z</dc:date><journalID>212176</journalID> <prism:publicationName>Proc Natl Acad Sci U S A</prism:publicationName +><prism:endingPage>409</prism:endingPage> <doiResolver rdf:resource="http://dx.doi.org/10.1073/pnas.021544898"/> +<dc:identifier>doi:10.1073/pnas.021544898</dc:identifier><pmidResolve +r rdf:resource="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Ret +rieve&db=pubmed&dopt=Abstract&list_uids=11149952"/><dc:identifier>PMI +D: 11149952</dc:identifier></rdf:Description></citation></dcterms:URI +></uri></Post>

When I try to extract the identifiers for which the code is something like this

use lib 'C:/Perl64/www-connotea-perl-0.1/lib/'; my $fn0="extracted-connotea-pubmedID-2.1.txt"; open (IN0, $fn0) or die "Can't open $fn0: $!\n"; open (FH, ">:utf8",'bookmarks_only_for_PubmedID.txt'); use lib '../lib'; use WWW::Connotea; my $currentURI; my $PMID; my $c = WWW::Connotea->new( user => 'myusername', password => '...... +..' ); $c->authenticate; ### dies if log-in credentials are incorrect while (<IN0>) { my $currentURI = $_; chomp($currentURI); my @tags = $c->posts_for(uri =>"$currentURI"); die "No candidate related articles\n" unless @tags; print FH "$currentURI\n"; foreach my $tag (@tags) { print FH "PMID: "; my $boo = $tag->bookmark(); my $foo = $boo->citation(); print FH $foo->identifiers(), "\n"; my $bar = grep(/PMID:^/, $foo->identifiers()); print FH $bar, "\n"; # if ($foo->identifiers() =~ m/(PMID:^)^.(\d+^)^/) # { # print FH "$2\n"; # } } } close IN0; close FH;

My issue is not about getting the data. The source file sometimes had either only the PMID or has two identifiers named doi and PMID. When I run this code to get the identifiers() I get both of them and the output looks something like this

http://www.ncbi.nlm.nih.gov/pubmed/15754555 PMID: PMID: 15754555 http://www.ncbi.nlm.nih.gov/pubmed/4012367 PMID: PMID: 4012367 http://www.ncbi.nlm.nih.gov/pubmed/20215333 PMID: doi:10.1093/fampra/cmq003PMID: 20215333 http://www.ncbi.nlm.nih.gov/pubmed/20429974 PMID: PMID: 20429974 http://www.ncbi.nlm.nih.gov/pubmed/20338007 PMID: doi:10.1111/j.1600-0838.2009.01081.xPMID: 20338007 http://www.ncbi.nlm.nih.gov/pubmed/17438827 PMID: PMID: 17438827 http://www.ncbi.nlm.nih.gov/pubmed/17447555 PMID: PMID: 17447555 http://www.ncbi.nlm.nih.gov/pubmed/17450784 PMID: PMID: 17450784

I want to have the output only with PMIDs something like this

http://www.ncbi.nlm.nih.gov/pubmed/15754555 PMID: 15754555 http://www.ncbi.nlm.nih.gov/pubmed/4012367 PMID: 4012367 http://www.ncbi.nlm.nih.gov/pubmed/20215333 PMID: 20215333 http://www.ncbi.nlm.nih.gov/pubmed/20429974 PMID: 20429974 http://www.ncbi.nlm.nih.gov/pubmed/20338007 PMID: 20338007 http://www.ncbi.nlm.nih.gov/pubmed/17438827 PMID: 17438827 http://www.ncbi.nlm.nih.gov/pubmed/17447555 PMID: 17447555 http://www.ncbi.nlm.nih.gov/pubmed/17450784 PMID: 17450784

I am trying to use regex but not quite sure where am I going wrong. Experts please help me. Thank you, Sammed


In reply to Extracting web data by smandape1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.