thandi has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse the following XML file. Absolute beginner, doesn't know how. These are some questions that I need to answer.

List All <AI> tags within the <AC n="CCC">
List only <AI> tags and its <Desc> within the <AC n="CCC">
List all <ID> tags and its <Desc> within <AI n="AAA">
etc...

This means that I need to extract differing info depending on the command line given on the command line:

Here's the 'modified' XML file:

<World n="earth" > <Space n="XXX" > <CL n="XXX"> <Desc>CL desc</Des> <Other/> <AC n="AAA" set="n"> <Desc>AC AAA desc</Desc> <AI n="AAA" set="n"> <Desc>AI AAA desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> <ID n="BBB" set="y"> <Desc>BBB ID desc</Desc> <What>What BBB ID </What> <AR>ID_bbb</AR> </ID> </AI> <AI n="BBB" set="y"> <Desc>AI BBB desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> </AI> </AC> <AC n="CCC" set="y"> <AI n="AAA" set="n"> <Desc>AI AAA desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> </AI> <AI n="AAA" set="n"> <Desc>AI AAA desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> </AI> </AC> <CL n="XXX"> <Space n="XXX" > <World n="earth" >
tks thandi

Replies are listed 'Best First'.
Re: XML::Twig question
by Tanktalus (Canon) on Dec 12, 2006 at 14:48 UTC
Re: XML::Twig question
by zentara (Cardinal) on Dec 12, 2006 at 15:55 UTC
    Check out PerlSax

    Just put your xml file as input to the following script, and watch the output. Then you can setup your handlers to filter out whatever you want. The advantage of this approach, is you can feed it huge files and it will process nodes as it finds them.

    #!/usr/bin/perl use warnings; use strict; use XML::Parser::PerlSAX; my $parser = new XML::Parser::PerlSAX( Handler => new SampleHandler ); $parser->parse( Source => { SystemId => shift } ); package SampleHandler; sub new { my $self = {}; return bless( $self ); } sub start_document { print "start_document\n"; } sub end_document { print "end_document\n"; } sub start_element { my ( $self, $element ) = @_; my $name = $element->{ Name }; print "start_element: '$name'\n"; while ( my ( $k, $v ) = each( %{ $element->{ Attributes } } ) ) { print " attribute: $k = $v\n"; } } #### a sample sub for parsing a specific node ########### #sub start_element { # my ( $self, $element ) = @_; # my $name = $element->{ Name }; # ## print "start_element: '$name'\n"; # if ( $name eq 'node' ) { # my %node = %{ $element->{ Attributes } }; # print $node{ 'id' }, ' ', $node{ 'lat' }, ' ', $node{ 'lon' }; # } #} ######################################## sub end_element { my ( $self, $element ) = @_; my $name = $element->{ Name }; print "end_element: '$name'\n"; } sub characters { my ( $self, $text ) = @_; my $data = $text->{ Data }; print "characters: '$data'\n"; }

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: XML::Twig question
by mirod (Canon) on Dec 12, 2006 at 16:28 UTC

    First your "XML" file is not really XML. That makes it hard to show you some examples of what you can do with XML::Twig. Then it is not clear from your question what you mean by "tag". Is it just the start tag, the entire element or the text of the element? The difference between tag and element is an important one in XML.

    Then if all you want is extract information from the file, you might want to have a look at xml_grep, which comes with XML::Twig. Have a look at the docs. If your files are not too big (ie XML::LibXML can load them in memory), you can also use xml_grep2, by the same author (xml_grep2), which has more complete XPath support (once again, at the cost of loading the entire document in memory).

    Otherwise, the code below will print the start tags of the AI elements in AC elements with a n attribute of CCC:

    XML::Twig->new( twig_handlers => { 'AC[@n="CCC"]//AI' => sub { print $_->start_tag, "\n"; }->parsefile( "my.xml");

    If you know that AI elements will allways be direct children of AC elements you can replace the '//' by a single '/'. If your XML file might be big, you could add a $_->purge at the end of the anonymous sub, in order to release some memory, or you could use the twig_roots option instead of twig_handlers.

    I hope that gets you started.

      What I need is to satisfy the condition of AC element(i.e. search for n="CCC", or any other value being search for), from there I then need to look in a similar fashion the AI element 'n=???' and/or 'set=???' conditions. This should then allow me to extract the text from the other tags between <ID> and </ID> tags.

      That's what I'm really after plus the text between <Desc>???</Desc> starting from the just below the <AC> tag. The text between the <Desc>???</Desc> is quite important because that can be changed/amended

      Also, is it possible to substitute "CCC" below with a variable e.g my $CCC = "CCC"
      XML::Twig->new( twig_handlers => { 'AC@n=$CCC//AI' => sub { print $_->start_tag, "\n"; }->parsefile( "my.xml");

      ID will always be direct children of AI
      AI will always be direct children of AC
      All of the above will always have a <Desc>???</Desc> below them describing what they're all about.

      <World n="earth" > <Space n="XXX"> <CL n="XXX"> <Desc>CL desc</Desc> <Other/> <AC n="AAA" set="n"> <Desc>AC AAA desc</Desc> <AI n="AAA" set="n"> <Desc>AI AAA desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> <ID n="BBB" set="y"> <Desc>BBB ID desc</Desc> <What>What BBB ID </What> <AR>ID_bbb</AR> </ID> </AI> <AI n="BBB" set="y"> <Desc>AI BBB desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> </AI> </AC> <AC n="CCC" set="y"> <AI n="AAA" set="n"> <Desc>AI AAA desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> </AI> <AI n="AAA" set="n"> <Desc>AI AAA desc</Desc> <ID n="AAA" set="y"> <Desc>AAA ID desc</Desc> <What>What AAA ID </What> <AR>ID_aaa</AR> </ID> </AI> </AC> </CL> </Space> </World>

        Here is what it would look like:

        #!/usr/bin/perl use strict; use warnings; use XML::Twig; my( $ac_n_value, $ai_att_cond)= @ARGV; XML::Twig->new( twig_handlers => { qq{AC[\@n="$ac_n_value"]//AI[$ai_at +t_cond]} => \&print_ai_data }) ->parsefile( "test_thandi.xml"); sub print_ai_data { my( $t, $ai)= @_; print "DESC: ", $ai->first_child( 'Desc')->sprint, "\n", "ID : ", $ai->first_child( 'ID') ->sprint, "\n" ; }

        You call this with the value you want for the n attribute of the AC element, and the condition for the AI element:

        perl test_thandi CCC '@n=AAA' perl test_thandi CCC '@n="AAA" or @set="y"'

        A couple of comments on the code: Perl and XPath strings don't really mix very well: you can use alternate quotes (qq{}) to avoid the collision of perl interpolating quotes and of XPath attribute quotes (or you can use ' instead of " in the XPath expression), but you need to backslash the @ used for attribute conditions in XPath, so it is not interpolated as an array by Perl. An alternate method is to use sprintf to build the XPath expression.

        This code loads the entire document in memory, which you may or may not want. There are techniques to avoid this described in the XML::Twig Tutorial.

        Also, you need the development version of XML::Twig to be able to run this, you can get it from xmltwig.com.

Re: XML::Twig question
by Jenda (Abbot) on Dec 29, 2006 at 15:46 UTC

    Apart from XML::Twig you could also use XML::Rules like this:

    use XML::Rules; my $AC_n = 'CCC'; my $parser = XML::Rules->new( rules => [ _default => 'as is', # by default keep both attributes and _content 'Desc,What,AR' => 'content', # for those keep just the content '^AC' => sub {return ($_[1]->{n} eq $AC_n)}, # only process the <AC> tags whose n attribute equals the $AC_n AC => '', # once processed, forget the contents of the <AC> tag AI => sub {print "Found AI: n=$_[1]->{n}, desc=$_[1]->{Desc}\n"; r +eturn}, # for each processed AI tag print the n attribute and Desc subtag +. # thanks to the 2nd rule you don't have to write # $_[1]->{Desc}{_contents} # As this rule returns nothing, the contents of the tag are not r +emembered. ], ); $parser->parse($XML);
    Unlike some of the suggested XML::Twig solutions this doesn't kep the whole document in memory, at each moment at most the data from a single <AI> tag and the attributes of the unclosed parent tags are kept.