It looks like you are using XML::Twig in "tree mode" as opposed to "stream mode", which is what I suspected. It means that your code tries to read the entire tree into memory, instead of processing it one chunk at a time, which is what stream mode is good for.

As I understand it, there are two basic approaches to parsing a tree like this. You can first build a tree object that then your program can traverse up and down as it pleases, and manipulate like any other data structure. Alternatively, you can define handlers (aka callbacks) that the parser will invoke whenever it encounters a particular condition (e.g. when it finds a particular tag) as it parses the tree. The latter ("event-driven") approach has the advantage that the parser does not need to read the whole tree into memory; the parsing and whatever you want to do with the parsed text go hand-in-hand. The downside is that your program cannot backtrack to examine parts of the tree that have already been parsed. I'm not very familiar with XML::Twig but it appears that it is a bit of a hybrid of these two basic approaches, in that it lets you install handlers that are triggered by parsing events, but it also lets your program access subtrees of the partially parsed tree. These subtrees can be manipulated as a whole, and then purged from memory. This makes it possible to keep only a small part of the tree in memory, just like with any other event-driven parser, such as XML::Parser, but manipulate entire chunks of the tree (possibly the whole tree) like you could with a tree-building parser.

Anyway, be that as it may, below is my attempt to re-cast your code in terms of XML::Twig handlers. See the docs for XML::Twig for more details. I could not understand the point of changing directories, so that part of the code may be messed up; I commented it out for the purpose of running the code. The program installs two handlers at the time of invoking the constructor for the parser, one for Topic elements and one for ExternalPage elements. The handlers communicate via a shared variable, the %links hash.

Let me know how it goes.

#!/usr/bin/perl use warnings; use strict; use XML::Twig; my $twig= XML::Twig->new( twig_handlers => { 'RDF/Topic' => \&topic, 'RDF/ExternalPage' => \&extpage } ); $twig->parsefile( './sample.xml'); # my $base_dir = 'F:/httpserv'; # chdir $base_dir or die "Failed to chdir to $base_dir: $!\n"; { my %links; sub topic { my ( $twig, $topic ) = @_; if ( $topic->children('link')) { # my $dir = $topic->att('r:id'); # chdir $dir or "Failed to chdir to $dir: $!\n"; $links{ $_->att('r:resource') } = $_ for $topic->children('link' +); } else { %links = (); } $twig->purge; } sub extpage { my ( $twig, $extpage ) = @_; if ( exists $links{ $extpage->att( 'about' ) } ) { print $extpage->first_child_text('d:Title'), "\n"; print $extpage->first_child_text('d:Description'), "\n"; } $twig->purge; # chdir $base_dir or die "Failed to chdir to $base_dir: $!"; } } __END__ British Horror Films: 10 Rillington Place Review which looks at plot especially the shocking features of it. MMI Movie Review: 10 Rillington Place Review includes plot, real life story behind the film and realism in t +he film.

the lowliest monk


In reply to Re^3: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by tlm
in thread Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.