mwinterer has asked for the wisdom of the Perl Monks concerning the following question:

I am getting a strange behavior with XML::Parser (v. 2.40) . This is on a Solaris 10 system. When parsing a file the last few days there have been a few odd cases where and attribute is parsed twice (or maybe split). Not sure how to describe. This only occurs once in a while.

The issue occurs in the "hdl_char" routine. This is a simple sub that just assigns values to a hash array where the hash elements are the attribute names for later processing.

Here is the output of printing each call to hdl_char with the attrib=value pairs. Note the "..." are added to check for blank padding. Note the first case of sessionStartDateTime where the actual value has been split which results in the last value "Z" being assigned to the hash. The second attribute sessionEndDateTime is correctly parse. I have check the source file and there are no spurious characters. I have listed part of the source record below also.

currattr:GMTSessionStartDateTime = ... currattr:GMTSessionStartDateTime = ... currattr:sessionStartDateTime = 2013-09-10T17:15:00.000... currattr:sessionStartDateTime = Z... currattr: = ... currattr: = ... currattr:timeZoneOffset = -240... currattr: = ... currattr: = ... currattr: = ... currattr: = ... currattr:GMTSessionEndDateTime = ... currattr:GMTSessionEndDateTime = ... currattr:sessionEndDateTime = 2013-09-10T17:30:00.000Z... currattr: = ... currattr: = ... currattr:timeZoneOffset = -240...
<ns0:GMTSessionStartDateTime> <ns0:sessionStartDateTime>2013-09-10T17:15:00.000Z</ns0:sessionStartDa +teTime> <ns0:timeZoneOffset>-240</ns0:timeZoneOffset> </ns0:GMTSessionStartDateTime> <ns0:GMTSessionEndDateTime> <ns0:sessionEndDateTime>2013-09-10T17:30:00.000Z</ns0:sessionEndDateTi +me> <ns0:timeZoneOffset>-240</ns0:timeZoneOffset> </ns0:GMTSessionEndDateTime>

This is an intermittent problem but is causing real issues for me at this point. Since this handler is called by the parser then I am making an assumption this is a parser issue. I am using Expat also.

Replies are listed 'Best First'.
Re: XML::Parser error
by runrig (Abbot) on Sep 11, 2013 at 18:23 UTC
    The char handler does not necessarily return all of the contiguous char data in one go. From the docs:
    Char (Expat, String) This event is generated when non-markup is recognized. The non-markup sequence of characters is in String. A single non-markup sequence of characters may generate multiple calls to this handler. Whatever the encoding of the string in the original document, this is given to the handler in UTF-8.
    So you will have to concatenate the char data yourself. That said, I would likely use some higher level library rather than use XML::Parser directly, then you very likely wouldn't have to do the concatenating yourself.

      Thanks. That helped how to fix for now. Any suggestions on what higher level module would be good to use? Basically I am parsing XML files delivered every 15 minutes to my server to parse and load into the DB. I looked around when developing and didn't really see anything that would fit that. Maybe I didn't look hard enough though.

        I'm partial to XML::Rules. Others like XML::Twig. Either is good if the docs are large and you should be processing the doc as you read it. XML::LibXML can be appropriate also (XPath support is great if you need that sort of thing).
Re: XML::Parser error (not)
by Anonymous Monk on Sep 11, 2013 at 22:36 UTC

      Sigh. I did not read the docs closely so it is my fault. I do appreciate your help. I have fixed my code and will look at higher level solutions. At least production is fixed.

      Reminds me of that Dilbert cartoon. I guess I should stand on my chair and shout out to the world "Do anyone know how to read the manual?" :)