Extracting Unique elements

smandape1 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Experts !! I am trying to extract data from web using Web API. The web query from which I am trying to extract the data is in RDF-XML format which cannot be uploaded here. So, I uploaded it on google.docs. Kindly access the file through this link https://docs.google.com/document/d/12sQnToF4Vzr3lKl5oxyEVwggCEEMKmeWIcnvBxUJR5g/edit?hl=en_US&authkey=CLfQkZUB I am trying to extract title(may be just dc:title or prism:title), PMID, users(creator) their respective tags(subject) and authors(foaf:name) out of this file using perl code. The rdf.xml file is just an example. If there are more than one users(creator) as shown in the file, the rest of the infomration for title, authors, PMID are repeated for all the users. I want to get unique title, authors, PMID and remove duplicates. I am all new to perl programing. The perl code is as follows:

use lib 'C:/Perl64/www-connotea-perl-0.1/lib/';

my $fn0="extracted-connotea-pubmedID-1.txt";
open (IN0, $fn0) or
    die "Can't open $fn0: $!\n";
    

open (FH, ">:utf8",'title_pmid_users_tags.txt');

###
# Modules Used
###

use lib '../lib';
use WWW::Connotea;


###
#Stage 0: Supply log-in credientals and autheticate
###

my $currentURI;


###
# Collect posts for the unique uris that is imported using file handle
+r.
###
   my $c = WWW::Connotea->new(  user => 'myusername', password => '...
+.....' );
    $c->authenticate;   ###  dies if log-in credentials are incorrect
    
    
    while (<IN0>)
{  
        my $currentURI = $_;                                    # for 
+each unique URI
    chomp($currentURI);
    
  my @tags = $c->posts_for(uri =>"$currentURI");                # To g
+et the posts for the unique uris
   die "No candidate related articles\n" unless @tags;

    print FH "$currentURI\n";
    
     # foreach my $tag (@tags) {                                # To g
+et the title directly from posts_for. It extracts the title from post
+s part in the XML file, the element <title>.
     # print FH "Title: ";
     # my $zoo = $tag->title;
     # print FH $zoo;
     # print FH "\n";
     # }
     
     # for my $tag (@tags) {                                    # To g
+et the title indirectly from posts_for using through bookmarks_for, j
+ust the element <dc:title>
     # print FH "title: ";
     # my $boo = $tag->bookmark();
     # print FH $boo->title();
     # print FH "\n";
     # }
     
     foreach my $tag (@tags) {                                    # To
+ get the title indirectly from posts_for using through bookmarks_for 
+and citations, the element <prism:title>
     print FH "title: ";
     my $boo = $tag->bookmark();
     my $zoo = $boo->citation();
     print FH $zoo->title();
     print FH "\n";
     }
     
     foreach my $tag (@tags) {
     print FH "PMID: ";
     my $boo = $tag->bookmark();
     my $foo = $boo->citation();
     for $bar($foo->identifiers()){
     if ($bar =~ /PMID: (\d+)/)
         {
            print FH "$1\n";
         }
        }
        }
     
 foreach my $tag (@tags) {
     print FH "User: ";
    my $bar = $tag->user;
              
    if (ref($bar) eq "ARRAY") {
      
    foreach my $q (@$bar){
      print FH $q ,",";
    }
    } else {
      print FH $bar,",";
    }
    print FH "Tags: ";
    my $foo = $tag->tags;
              
    if (ref($foo) eq "ARRAY") {
    foreach my $p (@$foo){
      print FH $p ,",";
    }
    } else {
    foreach my $p (@$foo) {
      print FH $foo,"\n";
     }
     }
     print FH "\n";
     
}
    

}

close IN0;
close FH;
[download]

When I use this code I get output something like this, this is the output for the rdf.xml file, which has repeated title and PMIDs and if I extract authors they will be repeated to ( I haven't yet extracted authors). I just want unique title and PMID and author information when I extract it.

http://www.ncbi.nlm.nih.gov/pubmed/17580848
title: Synthesis and evaluation of tripodal peptide analogues for cell
+ular delivery of phosphopeptides.
title: Synthesis and evaluation of tripodal peptide analogues for cell
+ular delivery of phosphopeptides.
PMID: 17580848
PMID: 17580848
User: guofengye,Tags: guofeg,
User: mblau3,Tags: pubmed,
[download]

Comment on Extracting Unique elements Select or Download Code

Replies are listed 'Best First'.
Re: Extracting Unique elements by toolic (Bishop) on Jun 14, 2011 at 23:35 UTC
Use a hash: perldoc -q uniq perldoc -q dup	[reply]
Re: Extracting Unique elements by wind (Priest) on Jun 14, 2011 at 23:49 UTC
perlfaq4 - How can I remove duplicate elements from a list or array?	[reply]
Re^2: Extracting Unique elements by smandape1 (Acolyte) on Jun 16, 2011 at 16:50 UTC
I tried, but I am unable to use them directly. The thing is the data 'title' gets extracted twice for more than one users because of the loop. I want to restrict the loop to extract elements like title, PMID and list of authors only once. And I want to do it while I am extracting it. It seems that I can remove the duplicates later but, it all messes up. Because there are some users and tags that are duplicates too, but I want them. Can you help please.	[reply]
Re^3: Extracting Unique elements by wind (Priest) on Jun 16, 2011 at 22:26 UTC
Just use a %seen hash like demonstrated in the resource I linked you to. It will enable you to filter out any duplicates as you go just as easily as removing duplicates after the fact.	[reply]
Re^4: Extracting Unique elements by smandape1 (Acolyte) on Jun 17, 2011 at 17:06 UTC
Re^5: Extracting Unique elements by wind (Priest) on Jun 17, 2011 at 20:00 UTC
Re^4: Extracting Unique elements by smandape1 (Acolyte) on Jun 21, 2011 at 03:09 UTC
Re^5: Extracting Unique elements by Anonymous Monk on Jun 21, 2011 at 03:15 UTC