comment on

Ever since I got acquainted with XPath syntax (finally! Why did I wait so long??), and the really excellent GNU LibXML package (which has a thorough and well-documented Perl wrapper XML::LibXML), I'm having a lot more fun with pulling stuff out of XML streams.

Below is a little perl script that uses XML::LibXML and it's XPath abilities to provide a generic command-line method for extracting any specific content from an XML file, so long as you can provide the XPath syntax for the content you want. Given that script, the particular task stated in the OP can be accomplished with this command line (assuming the XML data has the required closing tag, as mentioned in a previous reply, and is stored in a file called "test.xml"):

   exp  -p "//info_name | //it_size"  test.xml

# output:
FZGA34177.b1
35000
FZGA34178.b1
12000
FZGA34179.b1
7000
FZGA34180.b1
3000
FZGA34181.b1
7000
[download]

There's a pretty good reference for XPath usage here: http://www.w3schools.com/XPath/default.asp. The code for my "exp" utility is pretty simple:

#!/usr/bin/perl

use strict;
use XML::LibXML;
use Getopt::Long;
binmode STDOUT,":utf8";

my $Usage = "Usage:  $0 [-x] [-r] -p xpath_spec file.xml\n";
my %opt;
die $Usage unless ( GetOptions( \%opt, 'x', 'r', 'p=s' ) and
                    @ARGV == 1 and -f $ARGV[0] and $opt{p} =~ /\w/ );
my $xmlfile = shift;

my $xml = XML::LibXML->new;
my $doc;
if ( ! $opt{r} ) {
    $doc = $xml->parse_file( $xmlfile );
}
else {
    my $xmlstr = "<EXP_ROOT_$$>";
    $opt{p} = "/EXP_ROOT_$$" . $opt{p};
    {
        local $/;
        open( X, '<:utf8', $xmlfile ) or die "Unable to read $xmlfile:
+ $!\n";
        $xmlstr .= <X>;
        close X;
    }
    $xmlstr .= "</EXP_ROOT_$$>";
    $doc = $xml->parse_string( $xmlstr );
}
my $pth = XML::LibXML::XPathContext->new( $doc );
for my $n ( $pth->findnodes( $opt{p} )) {
    if ( $opt{x} ) {
        print $n->toString, "\n";
    } else {
        print $n->textContent, "\n";
    }
}

=head1 NAME

exp -- extract XPath matches from XML data

=head1 SYNOPSIS

 exp [-r] [-x] -p xpath_spec file.xml

  -r : supply a root node for the xml stream
  -x : output the matching content as xml elements

=head1 DESCRIPTION

This program will print portions (if any) from an XML file that match
a given XPath specifier.  

=head1 AUTHOR

David Graff <graff@ldc.upenn.edu>

=cut
[download]

In reply to Re: Parse XML and compare with Fasta in Perl by graff
in thread Parse XML and compare with Fasta in Perl by ad23

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.