parse xml from pubmed without attribute

HHCHANG has asked for the wisdom of the Perl Monks concerning the following question:

I want to parse xml from pubmed.

The xms file is from pubmed: http://www.ncbi.nlm.nih.gov/pubmed/?term=1766380&report=xml&format=text

I could read it into a hash which will count each element.

This is my Perl script:

#!/usr/bin/perl

use strict;
use warnings;

# use module
use XML::Simple;
use Data::Dumper;

our %pubmed_data;

my $xml = new XML::Simple (KeyAttr=>[]);

my $data = $xml->XMLin("data1.txt");

traverse( $data );

sub traverse {
    our %pubmed_data;
    my ($element) = @_;
    if( ref( $element ) =~ /HASH/ ) {
        foreach my $key (keys %$element) {
            traverse( $$element{$key} );
            }
    } 
    elsif( ref( $element)  =~ /ARRAY/ )  {
        traverse( $_ ) foreach @$element;
    } 
    else {
        if (exists $pubmed_data{$element} ) {
            $pubmed_data{$element}++;
            
            }
        else {
            $pubmed_data{$element} = 1;
            }
    }
}
[download]

However, there are many additional attribiutes in xml which I don't want it. For example,

<AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Miller</LastName>
                    <ForeName>S I</ForeName>
                    <Initials>SI</Initials>
                </Author>
</AuthorList>
[download]

I just want the elements: Miller, S I, SI. But I don't need

CompleteYN="Y", ValidYN="Y".

Any help would be great, Thanks in advance!

Comment on parse xml from pubmed without attribute Select or Download Code

Replies are listed 'Best First'.
Re: parse xml from pubmed without attribute by Marshall (Canon) on Sep 21, 2013 at 06:26 UTC
"I want to parse xml from pubmed." This is not the right idea. Pubmed provides a huge amount of software to access their site in C, C++ and Perl. toolkits: http://www.ncbi.nlm.nih.gov/guide/howto/dwn-software/ Bio-Perl Toolkit: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC187536/ There is a whole bunch of this stuff available for Perl and C. There is so much stuff available that it might take a few days to discover and understand the options. I recommend that you look at the free-ware that is available and see what you can do with it. I just gave a few links to get started. There are many more.	[reply]
Re: parse xml from pubmed without attribute by Anonymous Monk on Sep 21, 2013 at 05:07 UTC
Any help would be great, Thanks in advance! Forget about XML::Simple, use XML::Smart or XML::Twig or XML::LibXML or Mojo::DOM And now my linkdump of examples docs tutorials ... because xml::parser is low level, you should parse html/xml with xpath/twig/dom, Re: How to grab a portion of file with regex (don't), HTML Parser suggestions See also the real discouragement Oh Yes You Can Use Regexes to Parse HTML! and the real encouragement Re^2: parsing XML fragments (xml log files) with... a regex How do I match XML, HTML, or other nasty, ugly things with a regex? How do I remove HTML from a string? Re: Parsing webpages See htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions See also htmltreexpather.pl and xpather.pl htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions xpather.pl Re: Get Node Value from irregular XML (xpather.pl) Re: Having trouble with siblings Re^2: XML parsing and Lists Re: Counting number of child nodes based on element value (typos) Re^3: Extracting specific childnodes (xpath whitespace) Re^3: Extracting specific childnodes (play xmllint --shell ) Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How to parse xml with namespase vale in XMl:LibXML? ( XPath error : Undefined namespace prefix ) Re^2: How to parse xml with namespase vale in XMl:LibXML? (xmllint --shell setns / xpathtester) There is a better way :)	[reply]