Hello monks,
I'm having difficulty with an XML file. The file is huge (140,000 lines or so) and I need to get a summary of it to build a database to fit the data. Basically, I'd like to get a Hash with keys of every element in the file mapped to values of how many times they occur.
I've been trying to use Twig to extract the data and keep hitting a snag somewhere. I've read through the Camel book and the Twig tutorials, but this is the first Perl program I've ever written so I could really use some pointers. Here's what I've got so far.
use XML::Twig;
my $twig=XML::Twig->new(
twig_handlers =>
{ title => sub { $_->set_gi( 'h2') }, # tags to h2
para => sub { $_->set_gi( 'p') }, # para to p
populate=> sub { while (<>)
{ if (%Items !~ m/"<us:"|"<oa:"(.*)/) { $Items{$1} =1}
else {$Items{$2} =($Items{$1}+(/$1/))
}
#If element is not in the hash, adds it
#If element is in the hash, adds the number of matches to the value
};
},
hidden => sub { $_->delete; }, # remove hidden elements
list => \&my_list_process, # process list elements
div => sub { $_[0]->purge; }, # free memory
},
);
$twig->parsefile( 'bigXMLfile.xml'); # build it
print %Items; # output the twig
$twig->purge; # clear end of document from memory
I don't think that the program is calling the handlers properly when the file is parsed. The strangest part is that it keeps returning a value of 1. Did I somehow put the populate sub into a Boolean context? What am I doing wrong? Is there an easier way to get what I'm after?
Thanks!
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.