OK, it's time for the XML exercise of the day.

Just so you can compare the various programming styles for some of the most common XML modules here are solutions using XML::Parser, XML::Twig, XML::PYX and XML::XPath. You can choose the style you prefer, TMTOWTDI!

So first a solution using XML::Parser:

#!/bin/perl -w use strict; use XML::Parser; # global, but could be attached to the parser or passed to the handler +s my @results; # we need those to hold info about the parsing my( $elt_id, $child_seen, $in_child, $child_text); my $p= new XML::Parser( Handlers => { Start => \&start, # called when a start ta +g is found End => \&end, # called when an end tag + is found Char => \&char, # called when characters + are found }, ); $p->parse( \*DATA); # use parsefile to parse +... a file print join "\n", @results; print "\n"; sub start { my( $p, $elt, %atts)= @_; # set by XML::Parser::Ex +pat if( $elt eq 'elt') # we found an elt start +tag { $child_seen= 0; # reset the flag, no chi +ld found yet $elt_id= $atts{id}; # store it in case we ne +ed it } elsif( $elt eq 'child') # found a child start ta +g { $child_seen= 1; # we've seen a child $in_child= 1; # we are in the child $child_text= ''; # reset the child text } } sub end { my( $p, $elt)= @_; if( $elt eq 'elt') # found and elt end tag { if( $child_seen) { push @results, $child_text; } else { push @results, "missing child for elt $elt_id"; } } elsif( $elt eq 'child') # found a child end tag { $in_child= 0; } # Toto, I guess we are n +ot in the child any more } sub char # called for all non mar +k-up text { my( $p, $string)= @_; $child_text .= $string if( $in_child); # see the docs for why y +ou can't } # just write $child_text + = $string __DATA__ <doc> <dummy>I am a </dummy> <elt id="elt1"><child_1/><child>child 1</child></elt> <elt id="elt2"><child_1/></elt> <elt id="elt3"><child_1/><child>child 2</child></elt> <elt id="elt4"><child_1/></elt> </doc>

Then the obligatory XML::Twig plug:

#!/bin/perl -w use strict; use XML::Twig; my @results; # does not + have to be global, it's just easier # create the twig see the docs for why to use TwigRoots my $t= new XML::Twig( TwigRoots => { elt => \&check_elt }); # call che +ck_elt every time an element elt is parsed $t->parse( \*DATA); # parse th +e XML (use parsefile to parse... a file) print join "\n", @results; print "\n"; sub check_elt { my( $t, $elt)= @_;) # $t is th +e XML::Twig object # $elt is +an XML::Twig::Elt object if( my $child= $elt->first_child( 'child')) # that's h +ow you navigate the element { push @results, $child->text; } # text inc +ludes sub elements of child else { push @results, "missing child for elt " . $elt->att( 'id'); } $t->purge; # call onl +y if your document is huge } # to free +the memory __DATA__ <doc> <dummy>I am a </dummy> <elt id="elt1"><child_1/><child>child 1</child></elt> <elt id="elt2"><child_1/></elt> <elt id="elt3"><child_1/><child>child 2</child></elt> <elt id="elt4"><child_1/></elt> </doc>

Now the XML::PYX way. To run this one you need to have the document in a file, say doc.xml and to run pyx doc.xml | perl test_pyx .

#!/bin/perl -n -w use strict; # global, but could be attached to the parser or passed to the handler +s use vars qw( @results); # we need those to hold info about the parsing # @in_element is a stack of open elements, # the current element is $in_element[-1] use vars qw( @in_element $elt_id $child_seen $child_text); if(m/^\((.*)$/) # element start tag (tag { push @in_element, $1; if( $1 eq 'elt') # elt start tag { $child_seen= 0; } # reset the flag elsif( $1 eq 'child') # child start tag { $child_seen= 1; # set the flag $child_text= ''; # reset the text } } elsif( m/^A([^\s]*) (.*)$/) # attribute Aatt value { # store the id for elt elements $elt_id= $2 if( ($in_element[-1] eq 'elt') && ($1 eq 'id')); } elsif( m/^-(.*)\n/) # text -text { $child_text.= $1 if( $in_element[-1] eq 'child'); } elsif( m/\)(.*)$/) # end tag )tag { if( $1 eq 'elt') { if( $child_seen) { push @results, $child_text; } else { push @results, "missing child for elt $elt_id"; } } } END { print join "\n", @results; print "\n"; }

And finally the XML::XPath version. I am not very familiar with this module so it is probably not the most elegant way to do it, but hey, it works!

#!/bin/perl -w use strict; use XML::XPath; my @results; # create the xpath object from the DATA filehandle my $xp = XML::XPath->new( ioref => \*DATA); # find all paragraphs my $elts = $xp->find('/doc/elt'); foreach my $elt ($elts->get_nodelist) { # there is probably a more elegant way to get the child children # but I don't know XML::XPath enough my $children= $elt->getChildNodes; # get a +ll children my @children= grep { $_->getName eq 'child' } @$children; # grep +only the relevant ones if( @children) { push @results, $children[0]->string_value; } # that' +s how you get the text else { push @results, "missing child for elt " . $elt->getAttribute( +'id'); } } print join "\n", @results; print "\n"; __DATA__ <doc> <dummy>I am a </dummy> <elt id="elt1"><child_1/><child>child 1</child></elt> <elt id="elt2"><child_1/></elt> <elt id="elt3"><child_1/><child>child 2</child></elt> <elt id="elt4"><child_1/></elt> </doc>

In reply to Re: Expat by mirod
in thread Using Expat: how to extranct and manipulate elements? by Laila

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.