smandape1 has asked for the wisdom of the Perl Monks concerning the following question:
Hello Experts !! I am trying to extract data from web using Web API. The web query from which I am trying to extract the data is in RDF-XML format which cannot be uploaded here. So, I uploaded it on google.docs. Kindly access the file through this link https://docs.google.com/document/d/12sQnToF4Vzr3lKl5oxyEVwggCEEMKmeWIcnvBxUJR5g/edit?hl=en_US&authkey=CLfQkZUB I am trying to extract title(may be just dc:title or prism:title), PMID, users(creator) their respective tags(subject) and authors(foaf:name) out of this file using perl code. The rdf.xml file is just an example. If there are more than one users(creator) as shown in the file, the rest of the infomration for title, authors, PMID are repeated for all the users. I want to get unique title, authors, PMID and remove duplicates. I am all new to perl programing. The perl code is as follows:
use lib 'C:/Perl64/www-connotea-perl-0.1/lib/'; my $fn0="extracted-connotea-pubmedID-1.txt"; open (IN0, $fn0) or die "Can't open $fn0: $!\n"; open (FH, ">:utf8",'title_pmid_users_tags.txt'); ### # Modules Used ### use lib '../lib'; use WWW::Connotea; ### #Stage 0: Supply log-in credientals and autheticate ### my $currentURI; ### # Collect posts for the unique uris that is imported using file handle +r. ### my $c = WWW::Connotea->new( user => 'myusername', password => '... +.....' ); $c->authenticate; ### dies if log-in credentials are incorrect while (<IN0>) { my $currentURI = $_; # for +each unique URI chomp($currentURI); my @tags = $c->posts_for(uri =>"$currentURI"); # To g +et the posts for the unique uris die "No candidate related articles\n" unless @tags; print FH "$currentURI\n"; # foreach my $tag (@tags) { # To g +et the title directly from posts_for. It extracts the title from post +s part in the XML file, the element <title>. # print FH "Title: "; # my $zoo = $tag->title; # print FH $zoo; # print FH "\n"; # } # for my $tag (@tags) { # To g +et the title indirectly from posts_for using through bookmarks_for, j +ust the element <dc:title> # print FH "title: "; # my $boo = $tag->bookmark(); # print FH $boo->title(); # print FH "\n"; # } foreach my $tag (@tags) { # To + get the title indirectly from posts_for using through bookmarks_for +and citations, the element <prism:title> print FH "title: "; my $boo = $tag->bookmark(); my $zoo = $boo->citation(); print FH $zoo->title(); print FH "\n"; } foreach my $tag (@tags) { print FH "PMID: "; my $boo = $tag->bookmark(); my $foo = $boo->citation(); for $bar($foo->identifiers()){ if ($bar =~ /PMID: (\d+)/) { print FH "$1\n"; } } } foreach my $tag (@tags) { print FH "User: "; my $bar = $tag->user; if (ref($bar) eq "ARRAY") { foreach my $q (@$bar){ print FH $q ,","; } } else { print FH $bar,","; } print FH "Tags: "; my $foo = $tag->tags; if (ref($foo) eq "ARRAY") { foreach my $p (@$foo){ print FH $p ,","; } } else { foreach my $p (@$foo) { print FH $foo,"\n"; } } print FH "\n"; } } close IN0; close FH;
When I use this code I get output something like this, this is the output for the rdf.xml file, which has repeated title and PMIDs and if I extract authors they will be repeated to ( I haven't yet extracted authors). I just want unique title and PMID and author information when I extract it.
http://www.ncbi.nlm.nih.gov/pubmed/17580848 title: Synthesis and evaluation of tripodal peptide analogues for cell +ular delivery of phosphopeptides. title: Synthesis and evaluation of tripodal peptide analogues for cell +ular delivery of phosphopeptides. PMID: 17580848 PMID: 17580848 User: guofengye,Tags: guofeg, User: mblau3,Tags: pubmed,
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Extracting Unique elements
by toolic (Bishop) on Jun 14, 2011 at 23:35 UTC | |
|
Re: Extracting Unique elements
by wind (Priest) on Jun 14, 2011 at 23:49 UTC | |
by smandape1 (Acolyte) on Jun 16, 2011 at 16:50 UTC | |
by wind (Priest) on Jun 16, 2011 at 22:26 UTC | |
by smandape1 (Acolyte) on Jun 17, 2011 at 17:06 UTC | |
by wind (Priest) on Jun 17, 2011 at 20:00 UTC | |
by smandape1 (Acolyte) on Jun 21, 2011 at 03:09 UTC | |
by Anonymous Monk on Jun 21, 2011 at 03:15 UTC |