paragkalra has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

This for the first time I am interacting with a XML file through Perl. As a matter of fact I am using any XML file the first time so I may go here & there with the technical terms of the XML file. I apologize for that.

Following is the structure of my XML file

<root> <start_element> <element_num>1</element_num> <child>MyChild</child> </start_element> <start_element> <element_num>2</element_num> <child>MyChild</child> <user_id>MyUser</user_id> </start_element> <start_element> <element_num>3</element_num> <child>MyChild</child> </start_element> </root>

As you can see the xml file starts with the root element followed by many elements having the tag – ‘start_element’. Some of the elements may have child elements with the tag – user_id, as shown in the case of 2nd element. The value of the child element <user_id> can be anything.

Now I want to split the original xml file into 2 xml files. First xml files will contain all the elements having child element - <user_id>. The second file will contain all the remaining elements. So for above example, I will need to generate 2 files.

First xml file will contain following :

<root> <start_element> <element_num>2</element_num> <child>MyChild</child> <user_id>MyUser</user_id> </start_element> </root>

Second xml file will contain remaining 2 elements:

<root> <start_element> <element_num>1</element_num> <child>MyChild</child> </start_element> <start_element> <element_num>3</element_num> <child>MyChild</child> </start_element> </root>

Currently I am planning to process the above requirement using simple Perl regex. But I feel it can be made simpler using any of the available modules.

So I have following questions:

1. Which are the best available XML modules for Perl?

2. Out of the best available modules, which one would suite my requirement in the best way?

3. Any pointers to specific methods of the XML modules to suffice my needs would be helpful.

TIA

Cheers,

Parag

Replies are listed 'Best First'.
Re: Need to process a XML file through Perl
by Jenda (Abbot) on Dec 02, 2009 at 12:09 UTC
    use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( rules => { _default => 'raw extended', start_element => sub { my ($tag,$attr,$context,$parents,$parser) = @_; print {$parser->{parameters}[( exists $attr->{':user_id'} +? 0 : 1 )]} $parser->ToXML( $tag, $attr); } } ); open my $FH1, '>', 'c:\temp\test1.xml'; open my $FH2, '>', 'c:\temp\test2.xml'; print $FH1 "<root>\n"; print $FH2 "<root>\n"; $parser->parse( \*DATA, [$FH1, $FH2]); print $FH1 "\n</root>\n"; print $FH2 "\n</root>\n"; __DATA__ <root> <start_element> <element_num>1</element_num> <child>MyChild</child> </start_element> <start_element> <element_num>2</element_num> <child>MyChild</child> <user_id>MyUser</user_id> </start_element> <start_element> <element_num>3</element_num> <child>MyChild</child> </start_element> </root>

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Hi Jenda,

      Your code worked like a magic. Thanks a bunch for that.

      However my actual XML file is slightly different than the one I posted previously.

      Following is my actual XML file:
      <root> <!-- First Element --> <start_element> <header> <element_num>1</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> <!-- Second Element --> <start_element> <header> <element_num>2</element_num> </header> <contents> <child>MyChild</child> <userID>MyUser</userID> </contents> </start_element> <!-- Third Element --> <start_element> <header> <element_num>3</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> </root>

      I tried your old code for this XML file but it didn't work. I guess some modification is needed. I am a very early beginner to modify your code. Could you please guide me to make the necessary change.

      Also I am trying to provide the XML file as an input parameter. For that I have slightly modify your code but somehow its not working as shown below. Any pointers?

      use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( rules => { _default => 'raw extended', start_element => sub { my ($tag,$attr,$context,$parents,$parser) = @_; print {$parser->{parameters}[( exists $attr->{':userID'} ? + 0 : 1 )]} $parser->ToXML( $tag, $attr); } } ); open my $FH1, '>', 'test1.xml'; open my $FH2, '>', 'test2.xml'; open my $FH3, "$ARGV[0]"; print $FH1 "<root>\n"; print $FH2 "<root>\n"; $parser->parse( <$FH3>, [$FH1, $FH2]); print $FH1 "\n</root>\n"; print $FH2 "\n</root>\n";

      And if you could please explain your code line by line it will be great as I am really curious to know how is it working and significance of each & every line and each & every method. :)

        If I understand things right you want to look for the <userID> tag within the <contents> tag, right? In that case change

        print {$parser->{parameters}[( exists $attr->{':userID'} ? 0 : 1 )]} $ +parser->ToXML( $tag, $attr);
        to
        print {$parser->{parameters}[( (exists $attr->{':contents'} and exists + $attr->{':contents'}{':userID'}) ? 0 : 1 )]} $parser->ToXML( $tag, $ +attr);

        If you wanted to check for the <userID> tag anywhere below the <start_tag>, we'd have to write it differently. Something like

        use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( rules => { _default => 'raw', '^start_element' => sub { my ($tag,$attr,$context,$parents,$parser) = @_; $parser->{pad}{found_userID} = 0; return 1 }, userID => sub { my ($tag,$attr,$context,$parents,$parser) = @_; $parser->{pad}{found_userID} = 1; return [$tag => $attr] }, start_element => sub { my ($tag,$attr,$context,$parents,$parser) = @_; print { $parser->{parameters}[ $parser->{pad}{found_userID +} ] } $parser->ToXML( $tag, $attr), "\n"; } } ); open my $FH1, '>', 'c:\temp\test1.xml'; open my $FH2, '>', 'c:\temp\test2.xml'; print $FH1 "<root>\n"; print $FH2 "<root>\n"; $parser->parse( \*DATA, [$FH1, $FH2]); print $FH1 "</root>\n"; print $FH2 "</root>\n"; __DATA__ <root> <!-- First Element --> <start_element> <header> <element_num>1</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> <!-- Second Element --> <start_element> <header> <element_num>2</element_num> </header> <contents> <child>MyChild</child> <userID>MyUser</userID> </contents> </start_element> <!-- Third Element --> <start_element> <header> <element_num>3</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> </root>

        Let me try to explain. XML::Rules let's you specify what to do with the data for a tag once the start tag is parsed (the "^tagname" rules, only the attributes are available) or once the end tag is parsed (the "tagname" rules, the attributes, textual content and whatever the "handlers" for the child tags returned is available). The handler may decide to ignore the data, process it somehow or just pass it to the handler of the parent tag.

        The way the handler returns the data affects how is it made available to the handler of the parent tag. It may be added to the hash of attributes, may be joined with the textual content, may be push()ed at the end of things in the parent's contents, combined with an already existing attribute and any combination of those posibilities.

        There are quite a few builtin rules specifying what and how gets passed. The 'raw' used in the new script, puts all the data for a tag into the parents content in a way that ensures that the ->ToXML() call later will write exactly what was parsed including whitespace. The 'raw extended' does the same thing, but also adds the tag's data to the parent tag's attribute hash under the ':'.$tagname name. This makes checking whether that child tag was present easier.

        The handlers may also be subroutine references or unnamed subroutines. The one in the older script checks whether there was a childtag named 'user_id' (the _default handler would put it to the start_element's content for output and it's attribute hash for fast lookup) and based on that chooses into which filehandle to print the tag and its data converted back to XML. The scary lookling line could have been written like this:

        my $FH; if (exists $attr->{':user_id'}) { $FH = $parser->{parameters}[0]; } else { $FH = $parser->{parameters}[1]; } print $FH $parser->ToXML( $tag, $attr); </code></p> <p>The other script works differently. In the '^start_element' handler + it resets the flag (stored in $parser->{pad} which is an attribute o +f the parser specificaly "to put anything you want to and access it i +n any handler"), then if the &lt;userID&gt; tag is encountered the fl +ag is set and then the 'start_element' handler selects one of the fil +ehandles passed to <c>$parser->parse()
        and prints the tag and its data there.

        The rest is simple: an object is created, files are opened, text is printed, the parse() method is called (which reads the XML and calls the handlers as it goes through the XML) and the closing tag is printed.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: Need to process a XML file through Perl
by Anonymous Monk on Dec 02, 2009 at 09:24 UTC
    Currently I am planning to process the above requirement using simple Perl regex.

    Parsing XML is hard, use XML::Twig, XML::LibXML,...

Re: Need to process a XML file through Perl
by mirod (Canon) on Dec 02, 2009 at 18:46 UTC

    As said before, don't use regexp, use a module.

    But much to the surprise of everyone here, I am not going to pimp XML::Twig this time! xml_grep2, packaged as App::xml_grep2, will do this quite simply from the command line:

    xml_grep2 -v '//start_element[.//user_id]' user.xml > elements_with_no +_user_id.xml xml_grep2 -v '//start_element[count(.//user_id)=0]' user.xml > element +s_with_user_id.xml

    xml_grep2 works like grep, except it uses XPath instead of regular expression. So the -v flag will filter out any element that matches the XPath expression. The first one matches elements with a user_id descendant, and the second one matches elements with none.

    Voilà!

Re: Need to process a XML file through Perl
by paragkalra (Scribe) on Dec 03, 2009 at 18:45 UTC

    One more question Jenda

    What I noticed in the new files generated by 'XML::Rules' is that attributes having no value e.g <attribute1></attribute1> in the original file where simply written as </attribute> in the new file.

    Also what I found that original tags like

    <contents xsi:type="abc" xmlns:imp="xyz">

    where written like

    <contents xmlns:imp="xyz" xsi:type="abc">'

    As you can see the order of attributes inside the tag has changed.

    Not sure if these are XML features or something specific to module.

      As far as XML is concerned <tag></tag> and <tag/> are equivalent and the order of attributes is irrelevant. If I really had to I would be able to preserve the distinction and the order but its not worth the effort.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re: Need to process a XML file through Perl
by paragkalra (Scribe) on Dec 04, 2009 at 06:13 UTC

    Thanks Jenda for the confirmation