Hi Monks, I recently came across a weird bug when using the Perl XML parser. I'm not sure if it's Perl internal bug, or is caused by some unusual characters in my XML file.

Here's the parser I'm using:

my $parser = XML::Parser->new(Handlers => { Start => \&handle_start, Char => \&handle_char, End => \&handle_end });

The xml file has over 300 million lines. The part I'm having trouble with is

<Rating_Class_Text indicator="_">LT Issuer Rating</Rating_Class_Text>

By definition, the handle_char part should be able to read all the non-markup text into one String. For example, here it should read "LT Issuer Rating". The above line repeated hundreds of times in the xml file, and it works correctly for most of the times. However, there is one exception that it reads "LT" and " Issuer Rating" into two separate Strings.

Here're a few attempts I tried:

1) I tried checking that particular line using Emacs hexl mode, and didn't find any unusual character in that line.

2) I tried cutting the problematic part out, and create a new xml file containing only that part. It works correctly.

3) I tried creating a copy of the whole file, by removing all the non-printable characters (which basically removes the leading spaces before xml tags and makes the file shorter). This does solve the problem.

4) I tried only removing all the non-printable characters in that problematic line. This time Perl reads "LT Iss" and "user Rating".

Could anyone help to identify if this is a Perl XML parser module internal bug when handling large amounts of data? Or is this something wrong in my xml file? Thanks a lot!


In reply to Am I hitting a Perl XML parser module internal bug when dealing with large amounts of data? by feiiiiiiiiiii

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.