Mr.Churka has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks and thanks for bearing with me on this, my first project.

I have successfully written a program that parses a large XML document using twig. The XML file is 500 products with variable attributes. One might be a table with certain dimensions. Another might be a personal organizer with a weekly calendar and pockets for a calculator/cellphone etc. In order to get an idea of what variety of tables I need to use in my database, I wrote the following to summarize the xml.

#!/bin/perl use strict; use warnings; use XML::Twig; use Tie::IxHash; my %Items; my $Output_Filehandle; tie %Items, "Tie::IxHash"; my $twig=XML::Twig->new( twig_handlers => {_all_ => sub {my $Item_master_Ancestory = $_->ancestors; my $element_match = ($_->tag); my $text = ($_->trimmed_text); my $coupled = join( ' - ' => " "x$Item_master_Ancestory, +$element_match,keys %{$_->atts},values %{$_->atts},$text); if (!defined $Items{$coupled}){$Items{$coupled}=1} else {$Items{$coupled}++;} }, } ); $twig->parsefile( '500syncItemMaster.xml'); # build it $twig->purge; # clear end of document from memory open(SUMMARY, ">United perl parser summary.txt"); my @k = keys %Items; foreach my $k (@k) {print SUMMARY ("$k => $Items{$k}\n");};
I used join to combine all the relevant data into a unique hash key and then set the value to the number of times that key occurs in the xml file. This gave me a nice breakdown of what unique items are in the xml. My problem is twofold.

First, some of the elements are picking up the text from their children while others don't. This is very strange to me.

Second, the Tie module didn't keep the entire xml file ordered properly. for example

<Catalog> <item> <quantities>z <prices> <sellingpoints> <item> <quantities> <prices> <sellingpoints>
becomes:

Replies are listed 'Best First'.
Re: Maintaining parent child order when printing summarized XML Twig
by GrandFather (Saint) on Nov 16, 2007 at 21:41 UTC

    How about you provide a data sample about the size of that you have given already, but sufficiently complete to show the problems you are seeing, and sample code that is representative of the code you are using, but which we can run. The following sample may help as a starting point:

    #!/bin/perl use strict; use warnings; use XML::Twig; use Tie::IxHash; my %Items; tie %Items, "Tie::IxHash"; my $xml = <<XML; <Catalog> <item> <quantities> <prices/> </quantities> <sellingpoints/> </item> <item> <quantities> <prices/> </quantities> <sellingpoints/> </item> </Catalog> XML my $twig = XML::Twig->new( twig_handlers => { _all_ => \&handler, } ); $twig->parse($xml); # build it $twig->purge; # clear end of document from memory print "\n"; print "$_ => $Items{$_}\n" for keys %Items; sub handler { my $Item_master_Ancestory = $_->ancestors; my $element_match = $_->tag; my $text = $_->trimmed_text; my $coupled = join ' - ', " " x $Item_master_Ancestory, $element_match, keys %{ $_->atts }, values %{ $_->atts }, $tex +t; ++$Items{$coupled}; print "$coupled: $Items{$coupled}\n"; }

    Prints:

    - prices - : 1 - quantities - : 1 - sellingpoints - : 1 - item - : 1 - prices - : 2 - quantities - : 2 - sellingpoints - : 2 - item - : 2 - Catalog - : 1 - prices - => 2 - quantities - => 2 - sellingpoints - => 2 - item - => 2 - Catalog - => 1

    Perl is environmentally friendly - it saves trees
      I'm afraid a sample that would demonstrate what I'm talking about would be too big to post here. I've narrowed it down now and believe that what's happening relates to having to flush a twig after the document is parsed in order to obtain the last element. For some reason the last Item is just one huge concatenated block of text. How big can your code get before it's just too big to post?

        Too big is "more than required to demonstrate the issue". It is very unlikely that you need a large amount of data or code to demonstrate the issue, but you may need to do a fair amount of work to focus the code and data down to a sensible minimum. In the process you are also quite likely to find the issue yourself - but that need not be a problem. ;)


        Perl is environmentally friendly - it saves trees