Putting XML::Twig to the test

dHarry has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am doing a small proof of concept to see if XML::Twig will work for me, i.e. process large XML files in a reasonable amount of time. I use the script below to generate simple XML files for testing. I vary the value for $num_departments to obtain different sizes. A value of 1 mio produces an XML file of about 688 MB.

use strict;
use warnings;

my $file_name = "dharry.xml";
my $num_departments = 1000;

open (XML_OUT_FILE, ">$file_name")
or die "Could not open $file_name\n";

    print XML_OUT_FILE "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
    print XML_OUT_FILE "<Company>\n";
    for (my $i=0; $i<$num_departments; $i++) {
        print XML_OUT_FILE "  <Department>\n";
        print XML_OUT_FILE "  <Name>Bla$i</Name>\n";
        for (my $j=0; $j<5; $j++) {
            print XML_OUT_FILE "  <Person id=\"$i$j\">\n";
            print XML_OUT_FILE "    <First>John$i$j</First>\n";
            print XML_OUT_FILE "    <Last>Doe$i$j</Last>\n";
            my $phone_ext = int(rand(10000000));
            print XML_OUT_FILE "    <PhoneExt>$phone_ext</PhoneExt>\n"
+;
            print XML_OUT_FILE "  </Person>\n";
        }        
        print XML_OUT_FILE "  </Department>\n";
    }
    print XML_OUT_FILE "</Company>\n";
    
close XML_OUT_FILE
    or die "Could not close $file_name\n";    

print "Done...\n";
[download]

I have run many different tests, this is one of them that fails. I attempt to do a smart update: one Twig only, based on a specific value of the id attribute on the Person element.

use strict;
use warnings;
use XML::Twig;

# Select Twig bassed on value of id attribute on Person element
my $twig= new XML::Twig( 
                twig_handlers =>                  
                  { 'Person[@id="50000"]' => \&Person } 
                       );  
$twig->set_pretty_print ('record'); # Human readable output please
$twig->parsefile( "dharry.xml");             
$twig->flush;

sub Person { 

    my( $twig, $person)= @_;
    
    my $name = $person->first_child("First");
    $name->set_text("dHarry");
    $twig->flush;

}
[download]

Results

$num_departments	xml file size before	xml file size after	Time usage
1000	657 KB	708 KB	seconds
10_000	6.57 MB	7.07 MB	1 minute
100_000	67.2 MB	n/a	n/a

I was a bit surprised by the crashing of the program. I tried different xpath expressions and rerun the test. Sometimes the 67.2 MB file was processed successfully but bigger files could not be handled. Note that the resulting xml files get a bit bigger because of the pretty_print option. Any ideas why it isn’t working? Is my code wrong?

XML::Twig v3.32 running on Windows, Perl 5.88 ActiveState.
I have also run tests with parsefile_inplace which does not seem to make a difference.

Comment on Putting XML::Twig to the test Select or Download Code

Replies are listed 'Best First'.
Re: Putting XML::Twig to the test by mirod (Canon) on Aug 18, 2008 at 12:58 UTC
The problem here is that you `flush` the in-memory tree exactly twice: once when you hit the record you want to update, and once at the very end, when the file is all parsed. In your case, what you need is to use the `twig_roots` option, which will only process the 1 record you want to update, with the (awfully named) `twig_print_outside_roots` option, that will output, untouched, everything else: `my $twig= new XML::Twig( twig_roots => { 'Person[@id="50000"]' => \&Person }, twig_print_outside_roots => 1, );` [download] You don't need the last flush, at the end of the code. And I thought that you would have to replace the call to `flush` in the sub body by a call to `$person->print`, but it looks like `flush` is smart enough to do what you mean, so no change needed there. Does this help?	[reply] [d/l]
Re^2: Putting XML::Twig to the test by dHarry (Abbot) on Aug 18, 2008 at 13:56 UTC
Thanks for the quick response! The first tests I ran look promising. It will take some time run them on the really big files. I am gathering some stats on processing big XML files, big meaning up to 688 MB. I already tried the `twig_roots` and `twig_print_outside_roots` options but apparently I made the same flush mistake. Cheers dHarry	[reply] [d/l] [select]