dHarry has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am doing a small proof of concept to see if XML::Twig will work for me, i.e. process large XML files in a reasonable amount of time. I use the script below to generate simple XML files for testing. I vary the value for $num_departments to obtain different sizes. A value of 1 mio produces an XML file of about 688 MB.

use strict; use warnings; my $file_name = "dharry.xml"; my $num_departments = 1000; open (XML_OUT_FILE, ">$file_name") or die "Could not open $file_name\n"; print XML_OUT_FILE "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"; print XML_OUT_FILE "<Company>\n"; for (my $i=0; $i<$num_departments; $i++) { print XML_OUT_FILE " <Department>\n"; print XML_OUT_FILE " <Name>Bla$i</Name>\n"; for (my $j=0; $j<5; $j++) { print XML_OUT_FILE " <Person id=\"$i$j\">\n"; print XML_OUT_FILE " <First>John$i$j</First>\n"; print XML_OUT_FILE " <Last>Doe$i$j</Last>\n"; my $phone_ext = int(rand(10000000)); print XML_OUT_FILE " <PhoneExt>$phone_ext</PhoneExt>\n" +; print XML_OUT_FILE " </Person>\n"; } print XML_OUT_FILE " </Department>\n"; } print XML_OUT_FILE "</Company>\n"; close XML_OUT_FILE or die "Could not close $file_name\n"; print "Done...\n";

I have run many different tests, this is one of them that fails. I attempt to do a smart update: one Twig only, based on a specific value of the id attribute on the Person element.

use strict; use warnings; use XML::Twig; # Select Twig bassed on value of id attribute on Person element my $twig= new XML::Twig( twig_handlers => { 'Person[@id="50000"]' => \&Person } ); $twig->set_pretty_print ('record'); # Human readable output please $twig->parsefile( "dharry.xml"); $twig->flush; sub Person { my( $twig, $person)= @_; my $name = $person->first_child("First"); $name->set_text("dHarry"); $twig->flush; }

Results

$num_departments xml file size before xml file size after Time usage
1000 657 KB 708 KB seconds
10_000 6.57 MB 7.07 MB 1 minute
100_000 67.2 MB n/a n/a

I was a bit surprised by the crashing of the program. I tried different xpath expressions and rerun the test. Sometimes the 67.2 MB file was processed successfully but bigger files could not be handled. Note that the resulting xml files get a bit bigger because of the pretty_print option. Any ideas why it isn’t working? Is my code wrong?

NB
  1. XML::Twig v3.32 running on Windows, Perl 5.88 ActiveState.
  2. I have also run tests with parsefile_inplace which does not seem to make a difference.

Replies are listed 'Best First'.
Re: Putting XML::Twig to the test
by mirod (Canon) on Aug 18, 2008 at 12:58 UTC

    The problem here is that you flush the in-memory tree exactly twice: once when you hit the record you want to update, and once at the very end, when the file is all parsed.

    In your case, what you need is to use the twig_roots option, which will only process the 1 record you want to update, with the (awfully named) twig_print_outside_roots option, that will output, untouched, everything else:

    my $twig= new XML::Twig( twig_roots => { 'Person[@id="50000"]' => \&Person }, twig_print_outside_roots => 1, );

    You don't need the last flush, at the end of the code. And I thought that you would have to replace the call to flush in the sub body by a call to $person->print, but it looks like flush is smart enough to do what you mean, so no change needed there.

    Does this help?

      Thanks for the quick response! The first tests I ran look promising. It will take some time run them on the really big files. I am gathering some stats on processing big XML files, big meaning up to 688 MB. I already tried the twig_roots and twig_print_outside_roots options but apparently I made the same flush mistake.

      Cheers
      dHarry