Does anyone have some code that could guess whether some text (bytes) are Latin1 or UTF8? These are the only options I need to distinguish so a regexp or something that would say "this can't be UTF8" would be just fine.

We get some XML to import from several different companies (new ones being added from time to time). Quite often I find out later that even though the XML either doesn't specify the encoding or claims to be UTF-8 it's actually Latin1. Which means that as soon as there are some accentuated or fancy characters the XML is rejected with an "not well-formed (invalid token)" message. (MS Word loves to convert quotes, ampersands and dashes to some extended chars).

Of course the proper solution is to force the other side to either convert the stuff to UTF-8 or change the XML header, but that often takes some time on their end and the clients are not happy in the meantime.

I know I can catch the "invalid token" error, tweak the XML header and try to parse the XML again. I'd like to try to find out before I start the parsing.

Thanks, Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
   -- Rick Osborne


In reply to Guess between UTF8 and Latin1/ISO-8859-1 by Jenda

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.