in reply to Re: Dangerous XML::Twig (or XML::Parser?) bug. Long text is read incorrectly!
in thread Dangerous XML::Twig (or XML::Parser?) bug. Long text is read incorrectly!

Your fix caused the stuff after newlines to get lost so I tried to fix the fix.

In the first attempt I tried to replace the code on lines 1731-1733 by

$t->{twig_chunk_number} = 0 if !defined($elt->{cdata}); $elt->{cdata}.= $t->{twig_stored_spaces}.$string unless( $t->{twig_keep_encoding} && defined($elt->{cdata}) & +& length($elt->{cdata})>1024 && ++$t->{twig_chunk_number}==1) ; # fix +es a bug in XML::Parser for long CDATA
This helped somewhat, the script in the root node went fine, the text was complete. The problem was that as soon as I copied the long line another time within the tag, the original problem reappeared. The last several characters of the second line appeared twice.

I'll try something more, but I think the place this should be fixed is XML::Parser, not XML::Twig.

UPDATE: Looks like this works. You'll probably want to tweak it to fit better into the module:

$elt->{cdata}.= $t->{twig_stored_spaces}.$string unless $t->{twig_skip_next_chunk}; # fixes a bug in XML::Par +ser for long CDATA if ( $t->{twig_keep_encoding} && defined($string) && length($s +tring)>1024) { $t->{twig_skip_next_chunk} = 1; } else { $t->{twig_skip_next_chunk} = 0; }
Looks to me like we need to remember whether the last chunk was larger than 1024.

All tests that ran on my computer passed.

Jenda
XML sucks. Badly. SOAP on the other hand is the most powerfull vacuum pump ever invented.

Replies are listed 'Best First'.
Re^3: Dangerous XML::Twig (or XML::Parser?) bug. Long text is read incorrectly!
by mirod (Canon) on Aug 09, 2005 at 19:27 UTC

    Yes, I had a fix like this... but it was completely broken when I tested it with ISO-8859-1 extended characters (aka 'é'), the 1024 figure was not right. So I added a few layers of complex calculations, cursed a lot, wrote more tests directly on XML::Parser... until I realized that the solution was a lot simpler: for CDATA sections, the string passed to the character handler is in the original encoding (I had been staring at such strings for over an hour when it hit me!). So there was no need to do all this, just to use the usual string within a CDATA section... et voilà!

    Thanks for looking into it though.