in reply to Re^2: Out of memory with XML::Parser
in thread Out of memory with XML::Parser

Now we are getting somewhere.

Try looking at $expat->recognized_string or at $expat->original_string, see if they have what you are looking for on the first call to the Char handler.

Is the data in regular (PCDATA) text, or is it in a CDATA section?

Replies are listed 'Best First'.
Re^4: Out of memory with XML::Parser
by LukeyBoy (Friar) on Sep 14, 2005 at 20:58 UTC
    OK, I've done that for the element that's being problematic. It seems to just keep firing the character data sub over and over, and the expected and original strings always match (and always appear to be Base64-encoded data). The data is PCDATA, not CDATA. Should I try using CDATA?

      You don't need to use CDATA section as Base64 encoding does not use '<' or '&'

      Darn! I don't know what to say, especially as I am not able to reproduce the bug. What versions of perl, XML::Parser and expat are you using? On which OS?

      The code below works just fine for me (I actually get a single call for each long element).

      #!/usr/bin/perl use strict; use warnings; use XML::Parser; my $size= 500000; my @base64_chars=('a'..'z','A'..'Z','0'..'9','+','/','='); my $string= join( '', map { $base64_chars[rand(@base64_chars)] }(1..$s +ize)); my $doc= qq{<doc> <elt>foo</elt> <long1>$string</long1> <long2>$string</long2> <elt>bar</elt> </doc>}; my $p= XML::Parser->new( Handlers => { Char => \&char, }); $p->parse( $doc); exit; sub char { my( $expat, $char)= @_; print "in ", $expat->current_element, " - ", "length char: ", length( $char), " - ", "length recognized: ", length( $expat->recognized_string), " + - ", "length original: ", length( $expat->original_string), "\n", ; }

      I also tried getting the data from a file, and including "\n" to get several calls to the Char handler, and it all worked nicely:

      #!/usr/bin/perl use strict; use warnings; use XML::Parser; use Fatal qw(open); my $line_size= 100000; my $nb_lines= 5; my @base64_chars=('a'..'z','A'..'Z','0'..'9','+','/','='); my $line= join( '', map { $base64_chars[rand(@base64_chars)] }(1..$lin +e_size)); $line.= "\n"; my $string= $line x $nb_lines; my( $long1_length, $long2_length); my $doc= qq{<doc><elt>foo</elt><long1>$string</long1><long2>$string</l +ong2><elt>bar</elt></doc>}; open( my $xml, '>', "$0.xml"); print {$xml} $doc; close $xml; my $p= XML::Parser->new( Handlers => { Char => \&char, }); $p->parsefile( "$0.xml"); print "long1: $long1_length\n"; print "long2: $long2_length\n"; exit; sub char { my( $expat, $char)= @_; print "in ", $expat->current_element, " - ", "length char: ", length( $char), " - ", "length recognized: ", length( $expat->recognized_string), " + - ", "length original: ", length( $expat->original_string), "\n", ; if( $expat->in_element( 'long1')) { $long1_length+= length( $char); + } if( $expat->in_element( 'long2')) { $long2_length+= length( $char); + } }
        I figured it out! After appending the characters from XML::Parser to my string I now undef the expat character variable. Suddenly the whole script moves way faster and uses less memory. This is the new character data handling routine:
        Char => sub { my $expat = shift; my $chars = shift; $cbuffer = $cbuffer . $chars; undef $chars; }
        I'm using Debian Testing with XML::Parser version 2.34 (Debian revision 4). I'll play with this more tonight...
        Sorry, I made a mistake - the element it's choking on contains 900 kilobytes of data (I had the tracing statements in the wrong section). Since I know though the size of the undecoded data in advance (it's on of the element's attributes) is there any way I can get Perl to preallocate a scalar to that size + padding for the Base64 encoding? Also, when it's processing the huge file strace shows massive amounts of mremap calls:
        mremap(0x40617000, 827392, 827392, MREMAP_MAYMOVE) = 0x40617000 mremap(0x404bd000, 827392, 827392, MREMAP_MAYMOVE) = 0x404bd000 mremap(0x40617000, 827392, 827392, MREMAP_MAYMOVE) = 0x40617000 mremap(0x404bd000, 827392, 827392, MREMAP_MAYMOVE) = 0x404bd000