in reply to Re: Need to process a XML file through Perl
in thread Need to process a XML file through Perl

Hi Jenda,

Your code worked like a magic. Thanks a bunch for that.

However my actual XML file is slightly different than the one I posted previously.

Following is my actual XML file:
<root> <!-- First Element --> <start_element> <header> <element_num>1</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> <!-- Second Element --> <start_element> <header> <element_num>2</element_num> </header> <contents> <child>MyChild</child> <userID>MyUser</userID> </contents> </start_element> <!-- Third Element --> <start_element> <header> <element_num>3</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> </root>

I tried your old code for this XML file but it didn't work. I guess some modification is needed. I am a very early beginner to modify your code. Could you please guide me to make the necessary change.

Also I am trying to provide the XML file as an input parameter. For that I have slightly modify your code but somehow its not working as shown below. Any pointers?

use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( rules => { _default => 'raw extended', start_element => sub { my ($tag,$attr,$context,$parents,$parser) = @_; print {$parser->{parameters}[( exists $attr->{':userID'} ? + 0 : 1 )]} $parser->ToXML( $tag, $attr); } } ); open my $FH1, '>', 'test1.xml'; open my $FH2, '>', 'test2.xml'; open my $FH3, "$ARGV[0]"; print $FH1 "<root>\n"; print $FH2 "<root>\n"; $parser->parse( <$FH3>, [$FH1, $FH2]); print $FH1 "\n</root>\n"; print $FH2 "\n</root>\n";

And if you could please explain your code line by line it will be great as I am really curious to know how is it working and significance of each & every line and each & every method. :)

Replies are listed 'Best First'.
Re^3: Need to process a XML file through Perl
by Jenda (Abbot) on Dec 02, 2009 at 17:32 UTC

    If I understand things right you want to look for the <userID> tag within the <contents> tag, right? In that case change

    print {$parser->{parameters}[( exists $attr->{':userID'} ? 0 : 1 )]} $ +parser->ToXML( $tag, $attr);
    to
    print {$parser->{parameters}[( (exists $attr->{':contents'} and exists + $attr->{':contents'}{':userID'}) ? 0 : 1 )]} $parser->ToXML( $tag, $ +attr);

    If you wanted to check for the <userID> tag anywhere below the <start_tag>, we'd have to write it differently. Something like

    use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( rules => { _default => 'raw', '^start_element' => sub { my ($tag,$attr,$context,$parents,$parser) = @_; $parser->{pad}{found_userID} = 0; return 1 }, userID => sub { my ($tag,$attr,$context,$parents,$parser) = @_; $parser->{pad}{found_userID} = 1; return [$tag => $attr] }, start_element => sub { my ($tag,$attr,$context,$parents,$parser) = @_; print { $parser->{parameters}[ $parser->{pad}{found_userID +} ] } $parser->ToXML( $tag, $attr), "\n"; } } ); open my $FH1, '>', 'c:\temp\test1.xml'; open my $FH2, '>', 'c:\temp\test2.xml'; print $FH1 "<root>\n"; print $FH2 "<root>\n"; $parser->parse( \*DATA, [$FH1, $FH2]); print $FH1 "</root>\n"; print $FH2 "</root>\n"; __DATA__ <root> <!-- First Element --> <start_element> <header> <element_num>1</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> <!-- Second Element --> <start_element> <header> <element_num>2</element_num> </header> <contents> <child>MyChild</child> <userID>MyUser</userID> </contents> </start_element> <!-- Third Element --> <start_element> <header> <element_num>3</element_num> </header> <contents> <child>MyChild</child> </contents> </start_element> </root>

    Let me try to explain. XML::Rules let's you specify what to do with the data for a tag once the start tag is parsed (the "^tagname" rules, only the attributes are available) or once the end tag is parsed (the "tagname" rules, the attributes, textual content and whatever the "handlers" for the child tags returned is available). The handler may decide to ignore the data, process it somehow or just pass it to the handler of the parent tag.

    The way the handler returns the data affects how is it made available to the handler of the parent tag. It may be added to the hash of attributes, may be joined with the textual content, may be push()ed at the end of things in the parent's contents, combined with an already existing attribute and any combination of those posibilities.

    There are quite a few builtin rules specifying what and how gets passed. The 'raw' used in the new script, puts all the data for a tag into the parents content in a way that ensures that the ->ToXML() call later will write exactly what was parsed including whitespace. The 'raw extended' does the same thing, but also adds the tag's data to the parent tag's attribute hash under the ':'.$tagname name. This makes checking whether that child tag was present easier.

    The handlers may also be subroutine references or unnamed subroutines. The one in the older script checks whether there was a childtag named 'user_id' (the _default handler would put it to the start_element's content for output and it's attribute hash for fast lookup) and based on that chooses into which filehandle to print the tag and its data converted back to XML. The scary lookling line could have been written like this:

    my $FH; if (exists $attr->{':user_id'}) { $FH = $parser->{parameters}[0]; } else { $FH = $parser->{parameters}[1]; } print $FH $parser->ToXML( $tag, $attr); </code></p> <p>The other script works differently. In the '^start_element' handler + it resets the flag (stored in $parser->{pad} which is an attribute o +f the parser specificaly "to put anything you want to and access it i +n any handler"), then if the &lt;userID&gt; tag is encountered the fl +ag is set and then the 'start_element' handler selects one of the fil +ehandles passed to <c>$parser->parse()
    and prints the tag and its data there.

    The rest is simple: an object is created, files are opened, text is printed, the parse() method is called (which reads the XML and calls the handlers as it goes through the XML) and the closing tag is printed.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Thanks a lot Jenda for your excellent code

      It again worked like a charm.

      Also thanks a lot for the detailed notes.

      Although I am still trying to understand it. :)

      --Parag

        XML::Rules usually does take some brain rewiring. Unless you are used to callbacks, push style parsing or something like that. Try to come back once you learn a bit more Perl and hopefully it will all fall into place in time :-)

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.