I have a massive (600Mb+) xml file i need to process to extract some data from. all the line breaks have been removed and the file is one massive line.

I'm not what you would call extremely experienced with XML, and see my machine consume all available (2.7gb) of ram before running out of memory on a pretty simple script.

#!/usr/bin/perl -w use strict; use XML::Twig; use Data::Dumper; $|++; my $t = XML::Twig->new( #twig_roots => { 'Person' => 1}, # uncommen +t to dump entire XML in a hr form twig_handlers => { 'Person' => \&person }, pretty_print => 'indented', keep_encoding => 1, ); $t->parsefile('./File.xml'); $t->flush; sub person { my ($t, $section) = @_; # my $root = $section->root(); # uncomment do dump entire xml in +a hr form my $id= $section->att('id'); my (@firstname, @middlename, @lastname, $description); my @para= $section->getElementsByTagName('Name'); foreach my $obj (@para) { if ($obj->att('NameType') eq 'Primary Name' ) { my $child = $obj->first_child('NameValue'); @firstname = $child->fields('FirstName'); @middlename= $child->fields('MiddleName'); @lastname = $child->fields('Surname'); } } my @list= $section->getElementsByTagName('Descriptions'); foreach my $obj (@list) { my $child = $obj->first_child('Description'); $description = $child->{'att'}->{'Description2'} if ($child->{'att +'}->{'Description2'}); } print "$id,$firstname[0],$middlename[0],$lastname[0],$description\ +n" if ($description); }

if someone could provide some insight or alternative(s) it would be appreciated!


In reply to processing massive XML files with XML::Twig by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.