feiiiiiiiiiii has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I recently came across a weird bug when using the Perl XML parser. I'm not sure if it's Perl internal bug, or is caused by some unusual characters in my XML file.

Here's the parser I'm using:

my $parser = XML::Parser->new(Handlers => { Start => \&handle_start, Char => \&handle_char, End => \&handle_end });

The xml file has over 300 million lines. The part I'm having trouble with is

<Rating_Class_Text indicator="_">LT Issuer Rating</Rating_Class_Text>

By definition, the handle_char part should be able to read all the non-markup text into one String. For example, here it should read "LT Issuer Rating". The above line repeated hundreds of times in the xml file, and it works correctly for most of the times. However, there is one exception that it reads "LT" and " Issuer Rating" into two separate Strings.

Here're a few attempts I tried:

1) I tried checking that particular line using Emacs hexl mode, and didn't find any unusual character in that line.

2) I tried cutting the problematic part out, and create a new xml file containing only that part. It works correctly.

3) I tried creating a copy of the whole file, by removing all the non-printable characters (which basically removes the leading spaces before xml tags and makes the file shorter). This does solve the problem.

4) I tried only removing all the non-printable characters in that problematic line. This time Perl reads "LT Iss" and "user Rating".

Could anyone help to identify if this is a Perl XML parser module internal bug when handling large amounts of data? Or is this something wrong in my xml file? Thanks a lot!

  • Comment on Am I hitting a Perl XML parser module internal bug when dealing with large amounts of data?
  • Download Code

Replies are listed 'Best First'.
Re: Am I hitting a Perl XML parser module internal bug when dealing with large amounts of data?
by Corion (Patriarch) on Sep 22, 2014 at 16:45 UTC

    The documentation for XML::Parser says for the Char handler:

    This event is generated when non-markup is recognized. The non-markup sequence of characters is in String. A single non-markup sequence of characters may generate multiple calls to this handler. Whatever the encoding of the string in the original document, this is given to the handler in UTF-8.

    ... which sounds like the thing you're experiencing.

    Most likely, you want to accumulate data in your Char handler and flush it in your End handler and your Start handler.

      Got it. Thanks!
Re: Am I hitting a Perl XML parser module internal bug when dealing with large amounts of data?
by Anonymous Monk on Sep 22, 2014 at 23:08 UTC
      Made my code a lot shorter and clearer by using XML::Rules. Thanks :)