comment on

Hello Experts !! I am trying to extract data from web using Web API. The web query from which I am trying to extract the data is in RDF-XML format which cannot be uploaded here. So, I uploaded it on google.docs. Kindly access the file through this link https://docs.google.com/document/d/12sQnToF4Vzr3lKl5oxyEVwggCEEMKmeWIcnvBxUJR5g/edit?hl=en_US&authkey=CLfQkZUB I am trying to extract title(may be just dc:title or prism:title), PMID, users(creator) their respective tags(subject) and authors(foaf:name) out of this file using perl code. The rdf.xml file is just an example. If there are more than one users(creator) as shown in the file, the rest of the infomration for title, authors, PMID are repeated for all the users. I want to get unique title, authors, PMID and remove duplicates. I am all new to perl programing. The perl code is as follows:

use lib 'C:/Perl64/www-connotea-perl-0.1/lib/';

my $fn0="extracted-connotea-pubmedID-1.txt";
open (IN0, $fn0) or
    die "Can't open $fn0: $!\n";
    

open (FH, ">:utf8",'title_pmid_users_tags.txt');

###
# Modules Used
###

use lib '../lib';
use WWW::Connotea;


###
#Stage 0: Supply log-in credientals and autheticate
###

my $currentURI;


###
# Collect posts for the unique uris that is imported using file handle
+r.
###
   my $c = WWW::Connotea->new(  user => 'myusername', password => '...
+.....' );
    $c->authenticate;   ###  dies if log-in credentials are incorrect
    
    
    while (<IN0>)
{  
        my $currentURI = $_;                                    # for 
+each unique URI
    chomp($currentURI);
    
  my @tags = $c->posts_for(uri =>"$currentURI");                # To g
+et the posts for the unique uris
   die "No candidate related articles\n" unless @tags;

    print FH "$currentURI\n";
    
     # foreach my $tag (@tags) {                                # To g
+et the title directly from posts_for. It extracts the title from post
+s part in the XML file, the element <title>.
     # print FH "Title: ";
     # my $zoo = $tag->title;
     # print FH $zoo;
     # print FH "\n";
     # }
     
     # for my $tag (@tags) {                                    # To g
+et the title indirectly from posts_for using through bookmarks_for, j
+ust the element <dc:title>
     # print FH "title: ";
     # my $boo = $tag->bookmark();
     # print FH $boo->title();
     # print FH "\n";
     # }
     
     foreach my $tag (@tags) {                                    # To
+ get the title indirectly from posts_for using through bookmarks_for 
+and citations, the element <prism:title>
     print FH "title: ";
     my $boo = $tag->bookmark();
     my $zoo = $boo->citation();
     print FH $zoo->title();
     print FH "\n";
     }
     
     foreach my $tag (@tags) {
     print FH "PMID: ";
     my $boo = $tag->bookmark();
     my $foo = $boo->citation();
     for $bar($foo->identifiers()){
     if ($bar =~ /PMID: (\d+)/)
         {
            print FH "$1\n";
         }
        }
        }
     
 foreach my $tag (@tags) {
     print FH "User: ";
    my $bar = $tag->user;
              
    if (ref($bar) eq "ARRAY") {
      
    foreach my $q (@$bar){
      print FH $q ,",";
    }
    } else {
      print FH $bar,",";
    }
    print FH "Tags: ";
    my $foo = $tag->tags;
              
    if (ref($foo) eq "ARRAY") {
    foreach my $p (@$foo){
      print FH $p ,",";
    }
    } else {
    foreach my $p (@$foo) {
      print FH $foo,"\n";
     }
     }
     print FH "\n";
     
}
    

}

close IN0;
close FH;
[download]

When I use this code I get output something like this, this is the output for the rdf.xml file, which has repeated title and PMIDs and if I extract authors they will be repeated to ( I haven't yet extracted authors). I just want unique title and PMID and author information when I extract it.

http://www.ncbi.nlm.nih.gov/pubmed/17580848
title: Synthesis and evaluation of tripodal peptide analogues for cell
+ular delivery of phosphopeptides.
title: Synthesis and evaluation of tripodal peptide analogues for cell
+ular delivery of phosphopeptides.
PMID: 17580848
PMID: 17580848
User: guofengye,Tags: guofeg,
User: mblau3,Tags: pubmed,
[download]

In reply to Extracting Unique elements by smandape1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.