You're saying that someone took some Arabic text encoded in some suitable code page (encoding), and saved it as a stream of bytes. Then later, someone labeled that stream of bytes incorrectly as ISO-8859-1.

The mixed Latin text, is that the common ASCII subset? If so, then you have it easy. Just ignore the 8859-1 indication and state the correct encoding that it is. Read as that, or otherwise convert from that to Perl's internal representation of UTF-8.

If the Latin (non-Aribic) text is stored in some other code page that conflicts with the first, then you have to figure out how to separate them back out. I assume that would still be some 8-bit character set for Western languages, just with a few extras and accent marks.

First, no matter what, is to determine the Arabic code page that was used. There are a few to choose from. Is it single byte or multi-byte? If multi-byte, you can figure out if a sequence is syntactically correct in that code page.

If they are both single byte, and the Latin is not just plain ASCII but uses all 8 bits, you have to determine what Latin code page was being used.

In any case (for two single-byte char sets), for chars < 0x80, the character is clear, as ASCII is the common subset of all of them (I quibble. Dollar sign, backslash aside). So are the characters mixed in with that above 127 in whatever Western language or in Arabic? You might be able to tell by context: different fields, or different places in the text. Or, you might find that non-English but still Western text is mostly ASCII with an occasional accented letter thrown in, while the Arabic words are all in G1, er, I mean taken from characters in the range of A1-FE. That's because the single-byte Arabic character set still has Western letters and numbers in G0 (the common ASCII subset) and uses the high half for its own language (e.g. http://en.wikipedia.org/wiki/Code_page_1256.

I'm pretty good with that in general. I wanted to find a job being an expert in just that, but no takers. I've successfully figured out multi-re-encoding munges on numerous occasions.

So, feel free to discuss concepts and details, and PM me if I don't see the thread. But as of yet, insufficient data.

If the different encoding is per-field, you might end up dumping every field with Encoding A and asking someone who knows the language which are sense and which are nonsense. Repeat with Encoding B and that language. I just can't imagine mixing words in a paragraph -- there must be some natural boundaries between differently-encoded regions.

—John


In reply to Re: Encoding problem by John M. Dlugosz
in thread Encoding problem by grscott

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.