Using XML Twig to summarize a large file

Mr.Churka has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I'm having difficulty with an XML file. The file is huge (140,000 lines or so) and I need to get a summary of it to build a database to fit the data. Basically, I'd like to get a Hash with keys of every element in the file mapped to values of how many times they occur. I've been trying to use Twig to extract the data and keep hitting a snag somewhere. I've read through the Camel book and the Twig tutorials, but this is the first Perl program I've ever written so I could really use some pointers. Here's what I've got so far.

use XML::Twig;

my $twig=XML::Twig->new(   
    twig_handlers => 
      { title   => sub { $_->set_gi( 'h2') }, #  tags to h2
        para    => sub { $_->set_gi( 'p')  }, #  para to p
        populate=> sub { while (<>)
{ if (%Items !~ m/"<us:"|"<oa:"(.*)/) { $Items{$1} =1}  
else {$Items{$2} =($Items{$1}+(/$1/))
}

#If element is not in the hash, adds it 
#If element is in the hash, adds the number of matches to the value 
                         };
                        },           
    hidden  => sub { $_->delete;       }, # remove hidden elements
    list    => \&my_list_process,         # process list elements
    div     => sub { $_[0]->purge;     }, # free memory
      },
      );
                    
$twig->parsefile( 'bigXMLfile.xml'); # build it
print %Items;                        # output the twig
$twig->purge;                  # clear end of document from memory
[download]

I don't think that the program is calling the handlers properly when the file is parsed. The strangest part is that it keeps returning a value of 1. Did I somehow put the populate sub into a Boolean context? What am I doing wrong? Is there an easier way to get what I'm after? Thanks!

Comment on Using XML Twig to summarize a large file Download Code

Replies are listed 'Best First'.
Re: Using XML Twig to summarize a large file by GrandFather (Saint) on Nov 06, 2007 at 22:16 UTC
For a start you really, really, really need to use strictures (strict, warnings). `%Items !~ m/"<us:"\|"<oa:"(.*)/` is bogus - how do you apply a regex match to a hash? What do you expect `while (<>)` to do? It would help a lot if you provided just sufficient sample data to demonstrate the issue. You should provide the output that matches your sample data and the output you expect. Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re: Using XML Twig to summarize a large file by Skeeve (Parson) on Nov 06, 2007 at 21:54 UTC
Your code looks to me like you've copied a big portion from perldoc XML::Twig without understanding it. I don't belive thoes "title" and "para" handlers are needed for your specific XML. I'm also more than 99.99% sure that XML::Twig will do what you need. Just show us a snippet of your XML in question and we might be able to help you. This code is, at least for me, in no way helpful. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^2: Using XML Twig to summarize a large file by Anonymous Monk on Nov 07, 2007 at 14:58 UTC
You are correct. I cut and pasted and then entered the populate sub. It is my understanding that twig sets up handlers that are called for each element in the xml when you go to parse the file. The XML I'm dealing with is structured in a highly peculiar way. There's a brief header with information that is irrelevant to What I'm using the data for. All remaining data is under a parent titled "Data." Under that parent are roughly 500 children, each of which is a product with roughly 300 properties setup as children of it's own. The problem for me is that those properties aren't uniform. 60 items may have a listing for "number of pages" while others will have "number of tracks." Each item is massive, so here's a brief snippet. `<is:ItemMaster> <is:ItemMasterHeader> <oi:ItemID agencyRole="Product_Number">some_number</oi:ItemID> <oi:ItemID agencyRole="Prefix_Number">some_number</oi:ItemID> <oi:ItemID agencyRole="Stock_Number">some_number</oi:ItemID> <oi:ManufacturerItemID>some_manufacturer_ID</oi:ManufacturerID> <is:Classification type="Group"></is:Classification> <is:Classification></is:Classification>` [download] Each of these ItemMasters has around eight children and the children have anywhere from one to twenty-four children. Because the children are not uniform this is giving me headaches. Here's my first revision `#!/bin/perl use XML::Twig; %Items=(); my $twig=XML::Twig->new( twig_handlers => {populate=> sub { while (<>) { if (%Items !~ m/"<us:"\|"<oa:"(.*)/) { $Items{$1} =1} else {$Items{$2} =($Items{$1}+(/$1/)) } }; #If element is not in the hash, adds it }, #If element is in the hash, adds the number of matches div => sub { $_[0]->purge; }, # free memory }, ); $twig->parsefile( '500syncItemMaster.xml'); # build it $twig->purge; # clear end of document from memory print %Items; # output the twig` [download] Now when I print I get nothing. I tried a test run and it seems like the handlers are not getting called at all.	[reply] [d/l] [select]
Re^3: Using XML Twig to summarize a large file by mirod (Canon) on Nov 07, 2007 at 16:10 UTC
A handler is called when the associated expression triggers it, so what you wrote triggers a handler on every `populate` element. I don't see any element by that name in the XML, so the handler will not be called. Is there anything wrong with the `pyx` code I posted below? Or any specific reason why you would want to use XML::Twig despite it not being the most suited for the task?	[reply]
Re^4: Using XML Twig to summarize a large file by Mr.Churka (Sexton) on Nov 07, 2007 at 17:03 UTC
Re^5: Using XML Twig to summarize a large file by mirod (Canon) on Nov 07, 2007 at 17:40 UTC
Re: Using XML Twig to summarize a large file by mirod (Canon) on Nov 07, 2007 at 08:16 UTC
I really can't make sense of your code, but if you want to output the list of used elements and how many times they were used, then you can install XML::PYX and use the following one-liner: `pyx 'bigXMLfile.xml \| perl -n -e '$nb{$1}++ if( m/\A\((.*)\n/); \ END { map { print "$_ used $nb{$_} time(s) +\n";} sort keys %nb;}'` [download] There are of course ways to do this using XML::Twig, but as incredible as it might seem, every now and then it is not the easiest module to use. ;--)	[reply] [d/l]
Re: Using XML Twig to summarize a large file by weismat (Friar) on Nov 06, 2007 at 18:42 UTC
From my pov a SAX parser might fit better to your needs, as you do not really need to work with a tree. Unfortunately I have no experience with Perl's SAX parsing modules. I saw a nice tutorial at XML for Perl developers, Part 2: Advanced XML parsing techniques using Perl	[reply]