semio has asked for the wisdom of the Perl Monks concerning the following question:

fellow monks,

I've recently had a need to work with xml. This is my first trek into this so please bear with me. After reviewing some of the previous posts on this site, I decided to use the XML::Parser module. I started by simply trying to obtain and print the value for the tag "say."

The xml:

<?xml version="1.0" encoding="ISO-8859-1"?> <monk value="PM"> <say>JAPH</say> <vals> <val val1="F" val2="value f"> </val> <val val1="FO" val2="value fo"> </val> <val val1="FOO" val2="value foo"> </val> <val val1="FOOB" val2="value foob"> </val> <val val1="FOOBA" val2="value fooba"> </val> <val val1="FOOBAR" val2="value foobar"> </val> </vals> </monk>
The code:
#!perlenv -w use strict; use XML::Parser; my $xp; my $japh; $xp = new XML::Parser( Handlers => { Start => \&start_handler, End => \&end_handler, Char => \&char_handler } ); if ( $#ARGV < 0 ) { print "usage: blah <xml file>"; exit; } $xp->parsefile( $ARGV[0] ); sub start_handler { my ( $xp, $elem ) = @_; if ( $elem eq 'say' ) { $japh = 1; } } sub end_handler { my ( $xp, $elem ) = @_; if ( $elem eq 'say' ) { $japh = 0; } } sub char_handler { my ( $xp, $str ) = @_; if ($japh) { $japh = $str; print $japh . "\n"; } }
The overall goal however is to print the value for "say" only if val1 can be found to be equal to "FOO." After a few days of unsucessfull attempts with this, I went to regex just to have a working solution. The following code snip gives me what exactly what I need:
if ( $string[0] =~ /xml/ ) { foreach $string (@string) { if ( $string =~ m/<say>/ ) { $say = $string; } if ( $string =~ m/(<val val1="$val1")/ ) { $say =~ s/\s<say>//; $say =~ s/<\/say>//; print $say . "\n"; $found = 1; } if ( $string =~ m/<\/monk>/ ) { $say = ""; } } }
My questions are as follows: Being that regex gives me exactly what I need (in this particular case), should I be concerned with not using a XML parser? Is XML::Parser the right module to use in a case such as this? Just to be proper, I would like to use a XML parser when working with data such as this. Any suggestions that will point me in the right direction will be greatly appreciated.

cheers, -semio

Edit by tye to add READMORE tag

Replies are listed 'Best First'.
Re: XML::parser question
by Shendal (Hermit) on Oct 22, 2002 at 22:18 UTC
    I find that XML::Simple is very well suited to this sort of task. The following may fit your needs:
    #!/usr/bin/perl -w use strict; use XML::Simple; my $xml = << "EOF"; <?xml version="1.0" encoding="ISO-8859-1"?> <monk value="PM"> <say>JAPH</say> <vals> <val val1="F" val2="value f"> </val> <val val1="FO" val2="value fo"> </val> <val val1="FOO" val2="value foo"> </val> <val val1="FOOB" val2="value foob"> </val> <val val1="FOOBA" val2="value fooba"> </val> <val val1="FOOBAR" val2="value foobar"> </val> </vals> </monk> EOF my $xs = new XML::Simple; my $xmlhref = $xs->XMLin($xml); if ($xmlhref->{say}) { print $xmlhref->{say} . "\n" if (grep { $_->{val1} eq 'FOO' } @{$xm +lhref->{vals}->{val}}); }

    Cheers,
    Shendal

      I agree, XML::Simple is a nice module for quickly parsing simple XML data.

      Of course a question that may need to be answered is, "Why use a parser over a regexp?" Although the regular expression may do the job for you now, it may break in the future should the format of the incoming document change. No doubt you could extend your own routine to the point of being a full-featured parser, but that has already been done for you in the form of several very good parsing modules ( of course, if you can improve the world by writing a new parser, don't hesitate to do so! )

      Though the overhead of running a full-fledged parser may be a little greater than an ad hoc solution, it might save you some maintenance headaches later.

Re: XML::parser question
by grantm (Parson) on Oct 22, 2002 at 23:54 UTC

    This snippet uses XML::Simple to solve your problem as I understand it.

    use XML::Simple; my $monk = XMLin('./monk.xml', keyattr => {val => 'val1'}, forcearray => ['val'] ); if($monk->{vals}->{val}->{FOO}) { print $monk->{say}, "\n"; }

    That's not to say XML::Simple will handle all your XML parsing needs but it can do the simple stuff. For more complex stuff the options boil down to using a SAX parser or an XPath-capable DOM module (or XML::Twig for a more Perlish API).

    As you found, maintaining state when you use XML::Parser's handler API involves storing stuff in global variables (or using a closure if you're that way inclined). This is one of the reasons that the XML::Parser API is effectively deprecated in favour of XML::SAX which uses a cleaner object oriented style. However, with SAX you'll have the same problem you found with XML::Parser - having to write pages of code to perform a seemingly simple task.

    The other option is to use XML::XPath or XML::LibXML to slurp the XML file into a DOM tree which you can query with XPath statements. eg:

    use XML::XPath; use XML::XPath::XMLParser; my $xp = XML::XPath->new(filename => './monk.xml'); my $nodeset = $xp->find('/monk[./vals/val[@val1 = "FOO"]]'); foreach my $node ($nodeset->get_nodelist) { my $say = $xp->findvalue('./say', $node); print "$say\n"; }

    XPath is kind of a regex for XML syntax. The $xp->find statement returns a list of nodes which match the XPath expression (in this case all 'monk' elements which contain a 'val' element in a 'vals' element, where the 'val' element has a 'val1' attribute containing the string 'FOO'). The $xp->findvalue is then used to extract the contents of the returned node's 'say' child element.

    For more info, see the Perl-XML FAQ.

    Using regexes is not a great idea because although you can create something that works in simple cases it's really hard to cover all your bases (eg: what if the XML contains a numeric character entity or is UTF-16 encoded). To see just how hard it is to do it right, take a look at the source of XML::SAX::PurePerl.

      It's important to note that XML::Simple has been ported to the XML::SAX API (as XML::SAX::Simple). This means you can get the ease of use of XML::Simple with the benefits of XML::SAX. One of the chief benefits is that you can use other parsers besides expat (the parser XML::Parser uses). expat is notoriously inefficient as an XML parser. libxml2 (the C library at the core of XML::LibXML) is much faster, and has already been ported to the XML::SAX API. So if you used XML::LibXML as your parsing library and XML::SAX::Simple as your parser, you could get a very fast, very easy solution to this problem.

        Actually, you don't need XML::SAX::Simple - matts put that quick hack together while I was integrating SAX support directly into XML::Simple. Version 1.08_01 of XML::Simple supports SAX natively. It can act as a handler in the way you suggest, it can also drive a SAX pipeline from a Perl data structure and it can do both at the same time for filtering.

Re: XML::parser question
by mirod (Canon) on Oct 23, 2002 at 07:14 UTC

    Here is a solution using XML::Twig. Note that it relies on say being defined before vals (so when the handler on val is called the element say is already available.:

    #!/usr/bin/perl -w -l use strict; use XML::Twig; my $t= XML::Twig->new( twig_handlers => { 'val[@val1="FOO"]' => sub { print $_->parent( 'monk')->fie +ld( 'say'); } } ) ->parse( \*DATA); __DATA__ <?xml version="1.0" encoding="ISO-8859-1"?> <monk value="PM"> <say>JAPH</say> <vals> <val val1="F" val2="value f"> </val> <val val1="FO" val2="value fo"> </val> <val val1="FOO" val2="value foo"> </val> <val val1="FOOB" val2="value foob"> </val> <val val1="FOOBA" val2="value fooba"> </val> <val val1="FOOBAR" val2="value foobar"> </val> </vals> </monk>
Re: XML::parser question
by ktingle (Sexton) on Oct 23, 2002 at 14:12 UTC
    I have been using XML for almost 2 years now. I have found that many of my day-to-day tasks are best handled with a simple XPath statement. I use this tool for coming up with the correct statements;

    http://sourceforge.net/projects/xmltree

    I have been using MSXML3 mostly but I think XML::XPath could be used like so;

    use XML::XPath; use XML::XPath::XMLParser; my $xp = XML::XPath->new(filename => 'monks.xml'); my $nodeset = $xp->find('/monk/say/text()'); # what the monks say foreach my $node ($nodeset->get_nodelist) { $_ = XML::XPath::XMLParser::as_string($node); print "$_\n" if(/FOO/); }
      This tutorial is a quick way to get up-to-speed in XPath;

      XPath Tutorial
Re: XML::parser question
by princepawn (Parson) on Oct 23, 2002 at 14:10 UTC
Re: XML::parser question
by user2048 (Novice) on Oct 24, 2002 at 13:51 UTC
    Your code, changed to print all the say text so far every time a val1 attribute with the value FOO is encountered:
    #!perlenv -w use strict; use XML::Parser; my $in_say = 0; my $say = ''; my $xp = new XML::Parser( Handlers => { Start => \&start_handler, End => \&end_handler, Char => \&char_handler } ); if ( $#ARGV < 0 ) { print "usage: blah <xml file>"; exit; } $xp->parsefile( $ARGV[0] ); sub start_handler { my ( undef, $elem, %attrs ) = @_; if ( $elem eq 'say' ) { $in_say = 1; } my $val1 = $attrs{val1}; if (defined $val1 && $val1 eq 'FOO') { print $say; } } sub end_handler { my ( undef, $elem ) = @_; if ( $elem eq 'say' ) { $in_say = 0; } } sub char_handler { my ( undef, $str ) = @_; if ($in_say) { $say .= $str; } }
    Also changed: The global $xp is not hidden by a local (my) $xp in every handler.