andergoo has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have an XML file with this structure:
<document> <product> <date></date> <price></price> <content>heinous amount of unwanted text</conten> <color></color> </product> <product> <date></date> <price></price> <content>heinous amount of unwanted text</conten> <color></color> </product> </document>
Those <content> element contain huge amounts of text, as much as 80 MB. I need to truncate that text to, say 1-3MB, doesn't have to be exact. So I tried XML::Twig with like this:
XML::Twig->new( twig_roots => { content => \&content }, twig_print_out +side_roots => 1, keep_spaces => 1, ) ->parsefile( 'ginormous.xml'); exit; sub text { my( $t, $content) = @_; my $snipped = substr($content->text, 0, 1000000); $content->set_cdata($snipped); $t->flush; }
It kinda worked, but took, like, overnight. So I tried using XML:Sax to extract the one element and do the truncation, which worked great and took only a few minutes. So now I need to get rid of the <content> element in the original so I can plug the truncated stuff back into it. I thought this should work:
my $field= 'content'; my $twig= new XML::Twig( twig_roots => { $field => 1 }, twig_print_outside_roots => 1, twig_handlers => { $field => \&field } ); $twig->parsefile( "ginormous.xml"); sub field { my( $twig, $field)= @_; $field->delete; }
but it also took, like, overnight. How can I ignore <content> completely, and just print the rest?
Thanks!

Replies are listed 'Best First'.
Re: Prune Twig From Huge XML File
by mirod (Canon) on Mar 16, 2009 at 20:05 UTC

    You can use the ignore_elts => { content => 1 } option to ignore content.

    I am a bit surprised by the performances you mention though, especially as my tests (a little dated, but AFAIK not much has changed since) indicated that XML::Sax was quite slow. What perl version is this on?

      Hi, Thanks for the quick response, wow! I'll try that, thanks. I had already tried $content->ignore, but it didn't seem to speed things up. I am using 5.8. Perhaps the reason SAX was so quick is that is feeds data to a handler in chunks which I can ignore as soon as soon as my maximum length is reached...? Dunno, I am a beginner...

        It makes sense, I'll have to look at it tomorrow, it is indeed likely that XML::Twig reads the entire content of the element (worse than that, it probably buffers it, which implies copying it a few times...) before discarding it. I'll see if I can improve this.

       ignore_elts works like a charm, thanks!
Re: Prune Twig From Huge XML File
by Jenda (Abbot) on Mar 16, 2009 at 23:52 UTC

    Assuming you already have the trimmed contents somewhere and you don't mind that the order is not kept, this might work. The content of the <content> tag will be skipped by expat so it should not waste memory:

    use strict; use XML::Rules; my @contents = ('first trimmed content', 'second trimmed content'); my $parser = XML::Rules->new( style => 'filter', start_rules => { content => 'skip', }, rules => { _default => 'raw', product => sub { my ($tag, $attr, $parser) = @_[0,1,4]; $attr->{content} = [ $contents[ $parser->{pad}++ ] ]; return $tag => $attr; } }, ); $parser->filter(\*DATA); __DATA__ <document> <product> <date>2008-10-15</date> <price>124</price> <content>heinous amount of unwanted text</content> <color>red</color> </product> <product> <date>2009/01/30</date> <price>10</price> <content>heinous amount of unwanted text</content> <color>black</color> </product> </document>

    Or better formatted, but with even less defined order of child tags of <product> and ¡assuming all those child tags have no attributes or children! :

    use strict; use XML::Rules; my @contents = ('first trimmed content', 'second trimmed content'); my $parser = XML::Rules->new( style => 'filter', ident => ' ', stripspaces => 3, start_rules => { content => 'skip', }, rules => { _default => 'content array', product => sub { my ($tag, $attr, $parser) = @_[0,1,4]; $attr->{content} = [ $contents[ $parser->{pad}++ ] ]; return $tag => $attr; } }, ); $parser->filter(\*DATA); __DATA__ <document> <product> <date>2008-10-15</date> <price>124</price> <content>heinous amount of unwanted text</content> <color>red</color> </product> <product> <date>2009/01/30</date> <price>10</price> <content>heinous amount of unwanted text</content> <color>black</color> </product> </document>

    If the order is important you'd have to take the first code and tweak the handler of the <product> tag to insert the content at the right place of the array @{$attr->{_content}}.

      This is great stuff, thanks!
      I haven't used XML::Rules 'til now, looks like it's worth having a look at.
Re: Prune Twig From Huge XML File
by mirod (Canon) on Mar 17, 2009 at 09:50 UTC

    I check, and when you use ignore_elts, the data in the ignored element is never loaded. So there is no reason why the code shouldn't be fast.

    Indeed the following code worked takes 0.2s on my (rather slow) machine to prune a 200 MB document containing 20 content elements, each containing a 10 MB CDATA section:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; XML::Twig->new( ignore_elts => { content => 1 }, twig_handlers => { _d +efault_ => sub { $_->flush } }, keep_spaces => 1, ) ->parsefile( 'doc_with_big_content.xml');

    Now I have to see if I can improve the "snipping" part. Maybe by giving the option to not buffer the entire text for each element. How big is your file BTW?

      Yes, ignore_elts works perfectly and is very fast in removing the content. I was stupidly trying to use content->delete in a handler.

      My file is ~250MB, the biggest content chunk is 75MB.