Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

processing massive XML files with XML::Twig

by Anonymous Monk
on Dec 05, 2008 at 04:14 UTC ( [id://728184]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a massive (600Mb+) xml file i need to process to extract some data from. all the line breaks have been removed and the file is one massive line.

I'm not what you would call extremely experienced with XML, and see my machine consume all available (2.7gb) of ram before running out of memory on a pretty simple script.

#!/usr/bin/perl -w use strict; use XML::Twig; use Data::Dumper; $|++; my $t = XML::Twig->new( #twig_roots => { 'Person' => 1}, # uncommen +t to dump entire XML in a hr form twig_handlers => { 'Person' => \&person }, pretty_print => 'indented', keep_encoding => 1, ); $t->parsefile('./File.xml'); $t->flush; sub person { my ($t, $section) = @_; # my $root = $section->root(); # uncomment do dump entire xml in +a hr form my $id= $section->att('id'); my (@firstname, @middlename, @lastname, $description); my @para= $section->getElementsByTagName('Name'); foreach my $obj (@para) { if ($obj->att('NameType') eq 'Primary Name' ) { my $child = $obj->first_child('NameValue'); @firstname = $child->fields('FirstName'); @middlename= $child->fields('MiddleName'); @lastname = $child->fields('Surname'); } } my @list= $section->getElementsByTagName('Descriptions'); foreach my $obj (@list) { my $child = $obj->first_child('Description'); $description = $child->{'att'}->{'Description2'} if ($child->{'att +'}->{'Description2'}); } print "$id,$firstname[0],$middlename[0],$lastname[0],$description\ +n" if ($description); }

if someone could provide some insight or alternative(s) it would be appreciated!

Replies are listed 'Best First'.
Re: processing massive XML files with XML::Twig
by GrandFather (Saint) on Dec 05, 2008 at 04:31 UTC

    $t->purge (); or $t->flush (); at the end of sub person to free up memory associated with twigs you've processed.


    Perl's payment curve coincides with its learning curve.
      thanks for your help! i guess when one is in a rush important parts of the documentation get missed. ;-(
Re: processing massive XML files with XML::Twig
by cutlass2006 (Pilgrim) on Dec 05, 2008 at 06:52 UTC
    XML::Twig is a great way to go though you may find better performance if you refactor your approach to use SAX via CPAN XML::SAX module.

    update: hmmmm, after running some benchmarks myself ...I stand corrected ... I seem to have been passing on this duft knowledge for too long; thank you to mirod for opening my eyes ... XML::Twig is indeed faster in a lot of situations and I think you are going down the right route, past perhaps considering another tool outside of perl.

      Did you try? I mean did you compare the performances of XML::Twig and XML::SAX? Because I did, for a simple benchmark. Look at the last table.

      SAX is convenient because with modules like SAX::Machines it allows you to create pipelines of SAX filters, plug-in dumps... It is IMHO a pain to use. It is also demonstrably slow. At least in Perl.

      Sorry, you hit one of my pet peeves ;--)

      If you want better performance than XML::Twig, you can use XML::LibXML. The API is different (pure-DOM + XPath + fewer convenience methods than XML::Twig), and it is more difficult to process big files (but XML::LibXML uses less memory than XML::Twig, so you are more likely to be able to load the entire XML in memory).

        actually, XML::LibXML now has a pull parser (XML::LibXML::Reader) that doesn't read the entire dom into memory. much faster than XML::Twig. i've used it successfully.
Re: processing massive XML files with XML::Twig
by Jenda (Abbot) on Dec 05, 2008 at 23:48 UTC

    If you want alternatives you should show us an example of the XML and the data you want to extract.

    BTW, @firstname = $child->fields('FirstName'); doesn't look right at all. The @firstname will only ever contain one value, the one from the last <Name> tag.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://728184]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-29 05:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found