vili has asked for the wisdom of the Perl Monks concerning the following question:

Greetings enlightened ones!

I'm experiencing the following difficulty:

using XML::Parse, when an unescaped, single
quote is encountered in the element value, the parser quits stating:

"not well-formed (invalid token) at line 26, column 33, byte 725 at usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi/XML/Parser.pm line 187"

I have managed to pinpoint this to be the single quote in Captains Chair
<?xml version = "1.0"?> <newcars> <option> <seq>20</seq> <category>Seats</category> <primary>21B</primary> <secondary>21Bb</secondary> <option_desc>Second Row Captain's Chairs</option_desc> <price_desc>Cloth</price_desc> <retail>530</retail> <invoice>451</invoice> <disc_retail>0</disc_retail> <disc_invoice>0</disc_invoice> </option> </newcars>

If anyone knows something about this, has dealt with it in the past,
has an idea, thinks XML is retarded, and impractical,
or just wants to let me know what an idiot, and pseudo programmer I am...
Please do, I have been struggling with this for a week.

Mooo chos grassy ass:
 ~ vili

Replies are listed 'Best First'.
Re: single quotes in XML values
by Tanktalus (Canon) on Mar 07, 2005 at 22:57 UTC

    I am very positive I never had any problems putting apostrophes into my XML documents using XML::Twig (which uses XML::Parser under the covers). So I'm not really entirely sure that's the problem yet. Do you have a minimally-working program that shows the error? ("working" has a broad meaning here.) I'd really like to try this out myself.

    Thanks,

    Update: I saved your xml snippet to "a.xml" here, and created the following:

    use strict; use XML::Parser; my $p = XML::Parser->new(); $p->parsefile('a.xml');
    No error messages. So that sample code would be incredibly helpful! Thanks.

Re: single quotes in XML values
by mirod (Canon) on Mar 07, 2005 at 23:49 UTC

    The quote in the text should not cause any trouble. A quote could only be a problem if within an attribute quoted qith the same quote (as in <desc long='Second Row Captain's Chairs'...).

    This looks very much like an encoding problem to me. If you don't specify the encoding, it is assumed to be UTF-8 (or possibly UTF-16). Are you sure it is? Did you post the original data? Maybe what you think is a ' is in fact some other character, which would not be in UTF-8? Windows is notorious for using non-standard encodings, of which I know nothing, so if that's where the data is created you might want to look into that. Otherwise for example ʼ (Char 00700 | U+2bc | modifier letter apostrophe) looks suspiciously like a single quote but is not. Other suspects: ′ (I an sure you had recognized Char 08242 | U+2032 | prime), ’ (the famous Char 08217 | U+2019 | right single quotation mark).

    BTW, I used Sean Burke's excellent Unicode Sliderule to find this character.