Does anyone have some code that could guess whether some text (bytes) are Latin1 or UTF8? These are the only options I need to distinguish so a regexp or something that would say "this can't be UTF8" would be just fine.
We get some XML to import from several different companies (new ones being added from time to time). Quite often I find out later that even though the XML either doesn't specify the encoding or claims to be UTF-8 it's actually Latin1. Which means that as soon as there are some accentuated or fancy characters the XML is rejected with an "not well-formed (invalid token)" message. (MS Word loves to convert quotes, ampersands and dashes to some extended chars).
Of course the proper solution is to force the other side to either convert the stuff to UTF-8 or change the XML header, but that often takes some time on their end and the clients are not happy in the meantime.
I know I can catch the "invalid token" error, tweak the XML header and try to parse the XML again. I'd like to try to find out before I start the parsing.
Thanks, Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
In reply to Guess between UTF8 and Latin1/ISO-8859-1 by Jenda
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |