in reply to Dangerous XML::Twig (or XML::Parser?) bug. Long text is read incorrectly!

While you were at it, you could also have checked RT and found a curiously similar bug. Which is fixed in the new version (3.18) which is currently on its way to a CPAN mirror near you. ;--)

It looks like there is still a problem with my fix though. If the CDATA section contains a return, then anything after the return seems to be lost, which really is no way to behave. I'll check, test fix and update the module probably tomorrow.

  • Comment on Re: Dangerous XML::Twig (or XML::Parser?) bug. Long text is read incorrectly!

Replies are listed 'Best First'.
Re^2: Dangerous XML::Twig (or XML::Parser?) bug. Long text is read incorrectly!
by Jenda (Abbot) on Aug 09, 2005 at 17:55 UTC

    Your fix caused the stuff after newlines to get lost so I tried to fix the fix.

    In the first attempt I tried to replace the code on lines 1731-1733 by

    $t->{twig_chunk_number} = 0 if !defined($elt->{cdata}); $elt->{cdata}.= $t->{twig_stored_spaces}.$string unless( $t->{twig_keep_encoding} && defined($elt->{cdata}) & +& length($elt->{cdata})>1024 && ++$t->{twig_chunk_number}==1) ; # fix +es a bug in XML::Parser for long CDATA
    This helped somewhat, the script in the root node went fine, the text was complete. The problem was that as soon as I copied the long line another time within the tag, the original problem reappeared. The last several characters of the second line appeared twice.

    I'll try something more, but I think the place this should be fixed is XML::Parser, not XML::Twig.

    UPDATE: Looks like this works. You'll probably want to tweak it to fit better into the module:

    $elt->{cdata}.= $t->{twig_stored_spaces}.$string unless $t->{twig_skip_next_chunk}; # fixes a bug in XML::Par +ser for long CDATA if ( $t->{twig_keep_encoding} && defined($string) && length($s +tring)>1024) { $t->{twig_skip_next_chunk} = 1; } else { $t->{twig_skip_next_chunk} = 0; }
    Looks to me like we need to remember whether the last chunk was larger than 1024.

    All tests that ran on my computer passed.

    Jenda
    XML sucks. Badly. SOAP on the other hand is the most powerfull vacuum pump ever invented.

      Yes, I had a fix like this... but it was completely broken when I tested it with ISO-8859-1 extended characters (aka 'é'), the 1024 figure was not right. So I added a few layers of complex calculations, cursed a lot, wrote more tests directly on XML::Parser... until I realized that the solution was a lot simpler: for CDATA sections, the string passed to the character handler is in the original encoding (I had been staring at such strings for over an hour when it hit me!). So there was no need to do all this, just to use the usual string within a CDATA section... et voilà!

      Thanks for looking into it though.

Re^2: Dangerous XML::Twig (or XML::Parser?) bug. Long text is read incorrectly!
by Jenda (Abbot) on Aug 09, 2005 at 12:05 UTC

    Thanks. Sorry I did forget to check RT. I tested the 3.18 and you are right about the return. The sentence was not duplicated, but everything after the return was lost :-(

    Jenda
    XML sucks. Badly. SOAP on the other hand is the most powerfull vacuum pump ever invented.