Prune Twig From Huge XML File

andergoo has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have an XML file with this structure:

<document>
 <product>
  <date></date>
  <price></price>
  <content>heinous amount of unwanted text</conten>
  <color></color>
 </product>
 <product>
  <date></date>
  <price></price>
  <content>heinous amount of unwanted text</conten>
  <color></color>
 </product>
</document>
[download]

Those <content> element contain huge amounts of text, as much as 80 MB. I need to truncate that text to, say 1-3MB, doesn't have to be exact. So I tried XML::Twig with like this:

XML::Twig->new( twig_roots => { content => \&content }, twig_print_out
+side_roots => 1, keep_spaces => 1, ) ->parsefile( 'ginormous.xml');

exit;

sub text { 
my( $t, $content) = @_;
my $snipped = substr($content->text, 0, 1000000);
$content->set_cdata($snipped);

$t->flush; }
[download]

It kinda worked, but took, like, overnight. So I tried using XML:Sax to extract the one element and do the truncation, which worked great and took only a few minutes. So now I need to get rid of the <content> element in the original so I can plug the truncated stuff back into it. I thought this should work:

my $field= 'content';
                                            
my $twig= new XML::Twig( twig_roots    => { $field => 1 },
                                              
                         twig_print_outside_roots => 1, 
                         twig_handlers => { $field => \&field } ); 

$twig->parsefile( "ginormous.xml");               

sub field
  { my( $twig, $field)= @_;                      
    $field->delete;    
  }
[download]

but it also took, like, overnight. How can I ignore <content> completely, and just print the rest?
Thanks!

Comment on Prune Twig From Huge XML File Select or Download Code

Replies are listed 'Best First'.

Re: Prune Twig From Huge XML File
by mirod (Canon) on Mar 16, 2009 at 20:05 UTC

You can use the ignore_elts => { content => 1 } option to ignore content.

I am a bit surprised by the performances you mention though, especially as my tests (a little dated, but AFAIK not much has changed since) indicated that XML::Sax was quite slow. What perl version is this on?

[reply]
[d/l]
[select]

Re^2: Prune Twig From Huge XML File

by andergoo (Initiate) on Mar 16, 2009 at 20:54 UTC

$content->ignore

[reply]
[d/l]

Re^3: Prune Twig From Huge XML File

by mirod (Canon) on Mar 16, 2009 at 21:44 UTC

It makes sense, I'll have to look at it tomorrow, it is indeed likely that XML::Twig reads the entire content of the element (worse than that, it probably buffers it, which implies copying it a few times...) before discarding it. I'll see if I can improve this.

[reply]

Re^2: Prune Twig From Huge XML File

by Anonymous Monk on Mar 17, 2009 at 08:59 UTC

ignore_elts

[reply]
[d/l]

Re: Prune Twig From Huge XML File
by Jenda (Abbot) on Mar 16, 2009 at 23:52 UTC

Assuming you already have the trimmed contents somewhere and you don't mind that the order is not kept, this might work. The content of the <content> tag will be skipped by expat so it should not waste memory:

use strict;
use XML::Rules;

my @contents = ('first trimmed content', 'second trimmed content');

my $parser = XML::Rules->new(
  style => 'filter',
  start_rules => {
    content => 'skip',
  },
  rules => {
    _default => 'raw',
    product => sub {
      my ($tag, $attr, $parser) = @_[0,1,4];
      $attr->{content} = [ $contents[ $parser->{pad}++ ] ];
      return $tag => $attr;
    }
  },
);

$parser->filter(\*DATA);

__DATA__
<document>
 <product>
  <date>2008-10-15</date>
  <price>124</price>
  <content>heinous amount of unwanted text</content>
  <color>red</color>
 </product>
 <product>
  <date>2009/01/30</date>
  <price>10</price>
  <content>heinous amount of unwanted text</content>
  <color>black</color>
 </product>
</document>
[download]

Or better formatted, but with even less defined order of child tags of <product> and ｛ssuming all those child tags have no attributes or children! :

use strict;
use XML::Rules;

my @contents = ('first trimmed content', 'second trimmed content');

my $parser = XML::Rules->new(
  style => 'filter',
  ident => ' ', stripspaces => 3,
  start_rules => {
    content => 'skip',
  },
  rules => {
    _default => 'content array',
    product => sub {
      my ($tag, $attr, $parser) = @_[0,1,4];
      $attr->{content} = [ $contents[ $parser->{pad}++ ] ];
      return $tag => $attr;
    }
  },
);

$parser->filter(\*DATA);

__DATA__
<document>
 <product>
  <date>2008-10-15</date>
  <price>124</price>
  <content>heinous amount of unwanted text</content>
  <color>red</color>
 </product>
 <product>
  <date>2009/01/30</date>
  <price>10</price>
  <content>heinous amount of unwanted text</content>
  <color>black</color>
 </product>
</document>
[download]

If the order is important you'd have to take the first code and tweak the handler of the <product> tag to insert the content at the right place of the array @{$attr->{_content}}.

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]
[select]

Re^2: Prune Twig From Huge XML File

by Anonymous Monk on Mar 17, 2009 at 09:09 UTC

[reply]

Re: Prune Twig From Huge XML File
by mirod (Canon) on Mar 17, 2009 at 09:50 UTC

I check, and when you use ignore_elts, the data in the ignored element is never loaded. So there is no reason why the code shouldn't be fast.

Indeed the following code worked takes 0.2s on my (rather slow) machine to prune a 200 MB document containing 20 content elements, each containing a 10 MB CDATA section:

#!/usr/bin/perl

use strict;
use warnings;
 
use XML::Twig;

XML::Twig->new( ignore_elts => { content => 1 }, twig_handlers => { _d
+efault_ => sub { $_->flush } }, keep_spaces => 1, )
         ->parsefile( 'doc_with_big_content.xml');
[download]

Now I have to see if I can improve the "snipping" part. Maybe by giving the option to not buffer the entire text for each element. How big is your file BTW?

[reply]
[d/l]
[select]

Re^2: Prune Twig From Huge XML File

by andergoo (Initiate) on Mar 18, 2009 at 06:13 UTC

ignore_elts

content->delete

My file is ~250MB, the biggest content chunk is 75MB.

[reply]
[d/l]
[select]