Re: Encoding problem

You're saying that someone took some Arabic text encoded in some suitable code page (encoding), and saved it as a stream of bytes. Then later, someone labeled that stream of bytes incorrectly as ISO-8859-1.

The mixed Latin text, is that the common ASCII subset? If so, then you have it easy. Just ignore the 8859-1 indication and state the correct encoding that it is. Read as that, or otherwise convert from that to Perl's internal representation of UTF-8.

If the Latin (non-Aribic) text is stored in some other code page that conflicts with the first, then you have to figure out how to separate them back out. I assume that would still be some 8-bit character set for Western languages, just with a few extras and accent marks.

First, no matter what, is to determine the Arabic code page that was used. There are a few to choose from. Is it single byte or multi-byte? If multi-byte, you can figure out if a sequence is syntactically correct in that code page.

If they are both single byte, and the Latin is not just plain ASCII but uses all 8 bits, you have to determine what Latin code page was being used.

In any case (for two single-byte char sets), for chars < 0x80, the character is clear, as ASCII is the common subset of all of them (I quibble. Dollar sign, backslash aside). So are the characters mixed in with that above 127 in whatever Western language or in Arabic? You might be able to tell by context: different fields, or different places in the text. Or, you might find that non-English but still Western text is mostly ASCII with an occasional accented letter thrown in, while the Arabic words are all in G1, er, I mean taken from characters in the range of A1-FE. That's because the single-byte Arabic character set still has Western letters and numbers in G0 (the common ASCII subset) and uses the high half for its own language (e.g. http://en.wikipedia.org/wiki/Code_page_1256.

I'm pretty good with that in general. I wanted to find a job being an expert in just that, but no takers. I've successfully figured out multi-re-encoding munges on numerous occasions.

So, feel free to discuss concepts and details, and PM me if I don't see the thread. But as of yet, insufficient data.

If the different encoding is per-field, you might end up dumping every field with Encoding A and asking someone who knows the language which are sense and which are nonsense. Repeat with Encoding B and that language. I just can't imagine mixing words in a paragraph -- there must be some natural boundaries between differently-encoded regions.

—John

Comment on Re: Encoding problem