samtregar has asked for the wisdom of the Perl Monks concerning the following question:

Here's the situation: I've got a big ol' CSV file full of data exported from an MS SQL database. I need to massage this data into XML documents that can be parsed with XML::Parser. So far I've been using Text::xSV and XML::Writer to get the job done. All was well until I started getting errors from XML::Parser like this:

not well-formed (invalid token) at line 514, column 188, byte 72499 at /usr/local/lib/perl5/site_perl/5.6.1/i686-linux/XML/Parser.pm line +185

I get a few dozen of these. All the bytes they're pointing to are high-ASCII characters of some sort. I'm guessing this means I need to do something special to output clean UTF-8. Somehow I thought XML::Writer would take care of that, but I guess not.

My first attempt was to use Unicode::Map8 to translate the input data from Latin1 (a guess at the character set) to UTF-8. That didn't work. So I tried the umap utility, which I've used successfully in similar circumstances before:

umap latin1:utf8 < data.old > data.new

But XML::Parser doesn't like data.new any better than data.old.

So I come to the monks, on bended knee. I'd be happy to get anything from a new debugging technique or an RTFM link to an outright solution. Thanks!

Replies are listed 'Best First'.
Re: Generating UTF-8 from nasty high ASCII input
by grantm (Parson) on Jul 10, 2002 at 11:14 UTC

    This subject is discussed in the Perl XML FAQ. Given that you don't know what encoding(s?) the original data used, you might find the 'sanitise' function in the FAQ useful.

    I regularly hit this problem when people paste stuff from MSWord since the 'smart quote' characters are not in the ISO-8859-1 set.

      Thanks, that looks like it may be the solution. I'll try it this afternoon.

      -sam

Re: Generating UTF-8 from nasty high ASCII input
by IlyaM (Parson) on Jul 10, 2002 at 10:24 UTC
    Have you tried to specify encoding in generated XML files? I.e <?xml version="1.0" encoding="ISO-8859-1"> or <?xml version="1.0" encoding="UTF-8"> for translated XML file?

    --
    Ilya Martynov (http://martynov.org/)

      No, I haven't. The XML system on the other end is spec'd to only process UTF-8 although I don't know how strict a rule that is. I'll give it a try if santize() doesn't work out.

      -sam

Re: Generating UTF-8 from nasty high ASCII input
by Joost (Canon) on Jul 10, 2002 at 13:38 UTC
    If you really need to convert to UTF-8 (and I would try IlyaM's suggestion first) , you might be interested in the Encode module. It's in the 5.8.0 distribution.

    from the manpage:

    $string = decode(ENCODING, $octets [, CHECK]) Decodes a sequence of octets assumed to be in ENCODING into Perl's internal form and returns the resulting string. As in encode(), ENCODING can be either a canon- ical name or an alias. For encoding names and aliases, see "Defining Aliases". For CHECK, see "Handling Mal- formed Data".

    I've only played with this a little, but 5.8 (5.8.0 RC2 that is) seems to be a lot more stable when you use utf8; so it might be your best best.

    -- Joost downtime n. The period during which a system is error-free and immune from user input.
      Do you know of any reason that this would work where Unicode::Map8 and umap didn't? My impression is that they perform the same task.

      -sam