Catharsis has asked for the wisdom of the Perl Monks concerning the following question:

Im using XML::Twig to process a rather large XML document. have a look at the example below (moved from the scratchpad on advise below) It seams to Segfault no matter where I force it to exit the loop, or how.

Trouble here is that I HAVE to process the archive tag first, it has some information that is required for the rest of the code to work (shortened for example).

The alternative is of course to change the xml.. to

<archive>
    <important_tag info="here">
    <directory path="foo/"/>
    <directory path="foo/bar/"/>
    <directory path="foo/wibble/"/>
</archive>

Code follows....

#!/usr/bin/perl

use XML::Twig;

my $twig = new XML::Twig( TwigHandlers => { 
    archive => \&process_archive,

} );

open FILE, 'new.xml' or die ("arrrrrrrrrrgh");
read (FILE, $data, (-s 'new.xml'));
close FILE;

# dont work, note the file is 17MB of XML
#$twig->parse( $data );

# works
$twig->parse( \*DATA );

sub process_archive {
    my ($t, $elt) = @_;
   
    ## grab and process attributes

    my @children = $elt->children;
    my $count = 0;
    my $last = 0;        
    
    foreach ( @children ) {
        $count++;
        print "Processing Directory ".$count." of ".scalar @children."\n";
        last if ($count eq 2);
        
    }

}

__DATA__
<archive info="here">
    <directory path="foo/"/>
    <directory path="foo/bar/"/>
    <directory path="foo/wibble/"/>
</archive>

But it only does it with large docs, the small doc in the example works fine!

Some version info as requested. I know there is a newer version of XML::Twig out there, but Ive yet to try it.

OS: Linux - Fedora Core
Arcitecture : i386
Perl version : 5.8.1
XML::Twig ver : 3.17
XML::Parser ver : the one shipped with 5.8.1

Replies are listed 'Best First'.
Re: XML::Twig segfaulting on large docs
by Tanktalus (Canon) on Sep 27, 2005 at 17:13 UTC

    First off, pointing a node at your scratchpad is considered poor form around here because anyone doing a Super Search later to find your node may not be able to figure out what you're doing if you changed your scratchpad.

    Second, use [pad://Catharsis], e.g., Catharsis's scratchpad, to link to your scratchpad - that way you don't log anyone out.

    Finally, what version of perl, XML::Twig, and XML::Parser do you have installed/are you using? I found that by upgrading from 5.6 to 5.8.1, I got rid of all my XML::Twig crashes. I've also found that 5.8.7 fixes some other crashes I've been experiencing compared to 5.8.6. So it does pay here to upgrade to the latest possible stable version.

Re: XML::Twig segfaulting on large docs
by mirod (Canon) on Sep 27, 2005 at 19:08 UTC

    What is your configuration (Perl, OS, architecture)? A 17MB XML file should only be around 170MB once loaded in memory. It seems though that on a 64 bit architecture it could take a lot more, I can't tell, I don't have a 64-bit machine around.

    That said, XML::Twig was designed just for this type of situation, to avoid having to load the entire document.

    From your example (as Tanktalus mentioned, it would be better to put it in your node, at least I could look at it while answering), you do have access to attributes of the englobing tag within the nested elements. In a handler on the directory element, $directory->parent->att( 'info') is available. Alternatively, you can use the start_tag_handlers option to grab the info from the englobing tag and do something with it,without having to wait to parse the entire element.

      Thats what I was thinking...

      Upon Tanktalus advise (sorry its my first post) have moved stuff into the post and added some more info.

      I was thinking about going the whole start_tag_handlers route, Ive been put off as they (as in the company I work for) don't deam the problem critical and Im not sure I have the time to play with it.

      I kinda have a free day today so I'll give that a go in a small script.