grscott has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I wonder if someone can offer some wisdom, please, regarding a problem that has come my way:

I have a UTF8 file that contains mixed Latin and Arabic text that was extracted from a database that was encoded as ISO-8859-1 - so, essentially, as I understand it, the file is a utf8 encoding, of an ISO-8859-1 encoding, of Latin and non-Latin text (don't ask! :-).

I am trying to parse this stuff into a format suitable for insertion into a UTF-8 database, but am having trouble with the non-Latin stuff. The Latin stuff is fine.

I have not needed to mess around with encodings much in the past, so am rather out of my depth. I have tried various alternative 'layerings' or 'raw', 'utf8', and 'iso-8859-1' to try to get back to something that will go into the database and display correctly, but so far with no luck - different combinations produce different output, of course; but so far none of it any more helpful than what I started with!

Is it even possible to achieve what I am attempting?? If so, any pointers would be most gratefully received!

Thanks in advance!

GRS

Replies are listed 'Best First'.
Re: Encoding problem
by ikegami (Patriarch) on May 08, 2009 at 18:26 UTC

    Could you provide a sample of the file (as seen by a hex/oct dumper, preferably)? It sounds like your file contains strings of text encoded using more than one encoding, which no indication of which encoding is used for which string.

    If so, your file is messed up. Do you have the data needed to rebuild a sane file?

    If not, it may still be possible to make a fairly accurate guess of the encoding used for a span of text given the information you gave. It would help if we saw a sample of this file.

      Agreed, the file IS messed up - not my idea, honest! :-) And I hope that I can get something less bizarre in the course of time, but that probably wont be for a while.

      Can't append a sample, sadly, as the file is at work, and I'm not.

      Basically, I have been working along the lines of trying to hit on a 'use open' line that would figure out the weird input format, and let me output to something more sensible; something like:

      use open "IN" => ":encoding(iso-8895-1):encoding(utf8)", "OUT" => ":encoding(utf8)";
      But, as I say, really not sure what I am doing with this - are they the right values? Are they in the right sequence? Do I need anything else?? Not a clue, quite frankly! Would be something to know that I am / am not on the right lines, at least.

      Cheers,

      GRS

        It depends whether the data is double encoded, or whether you different encodings are used for different parts of the file. Thus my request for a sample of the file. I suspect the latter.

        Using :encoding twice (assuming it works at all) would only help the former case. The order for decoding would be the opposite order used for encoding.

        The latter case would involve looking at each byte or group of bytes and making guesses.

        PS — Don't use UTF8 (an encoding known only to Perl) when decoding. That leaves you open to a vulnerability. Use UTF-8 instead.

        Update: Using :encoding twice doesn't always work if ever. You'll need to use decode($enc1, decode($enc2, $_)) if your text is double-encoded.

Re: Encoding problem
by John M. Dlugosz (Monsignor) on May 09, 2009 at 00:00 UTC
    You're saying that someone took some Arabic text encoded in some suitable code page (encoding), and saved it as a stream of bytes. Then later, someone labeled that stream of bytes incorrectly as ISO-8859-1.

    The mixed Latin text, is that the common ASCII subset? If so, then you have it easy. Just ignore the 8859-1 indication and state the correct encoding that it is. Read as that, or otherwise convert from that to Perl's internal representation of UTF-8.

    If the Latin (non-Aribic) text is stored in some other code page that conflicts with the first, then you have to figure out how to separate them back out. I assume that would still be some 8-bit character set for Western languages, just with a few extras and accent marks.

    First, no matter what, is to determine the Arabic code page that was used. There are a few to choose from. Is it single byte or multi-byte? If multi-byte, you can figure out if a sequence is syntactically correct in that code page.

    If they are both single byte, and the Latin is not just plain ASCII but uses all 8 bits, you have to determine what Latin code page was being used.

    In any case (for two single-byte char sets), for chars < 0x80, the character is clear, as ASCII is the common subset of all of them (I quibble. Dollar sign, backslash aside). So are the characters mixed in with that above 127 in whatever Western language or in Arabic? You might be able to tell by context: different fields, or different places in the text. Or, you might find that non-English but still Western text is mostly ASCII with an occasional accented letter thrown in, while the Arabic words are all in G1, er, I mean taken from characters in the range of A1-FE. That's because the single-byte Arabic character set still has Western letters and numbers in G0 (the common ASCII subset) and uses the high half for its own language (e.g. http://en.wikipedia.org/wiki/Code_page_1256.

    I'm pretty good with that in general. I wanted to find a job being an expert in just that, but no takers. I've successfully figured out multi-re-encoding munges on numerous occasions.

    So, feel free to discuss concepts and details, and PM me if I don't see the thread. But as of yet, insufficient data.

    If the different encoding is per-field, you might end up dumping every field with Encoding A and asking someone who knows the language which are sense and which are nonsense. Repeat with Encoding B and that language. I just can't imagine mixing words in a paragraph -- there must be some natural boundaries between differently-encoded regions.

    —John

Re: Encoding problem
by Burak (Chaplain) on May 08, 2009 at 18:15 UTC
    Not sure what codepoints arabic text has but you can try to decode the text to utf8 if the chars have higher character points. However, most of the latin1 text must be compatible with utf8 IMO, so you must not have a corrupt data unless you have some other encodings in there
Re: Encoding problem
by graff (Chancellor) on May 09, 2009 at 15:50 UTC
    The Unicode range for Arabic is U+0600 - U+06FF, and the first 128 elements of this range is essentially equivalent to the various non-Unicode Arabic code pages (iso-8859-6 and cp1256, which differ from each other only in what they do with the "non-Arabic" code points -- i.e. the gaps around the actual Arabic letters).

    In general, if the Arabic data is in a single-byte encoding (i.e. not in Unicode), then your task is simply to convert from that to Unicode, and all you need to worry about is which non-unicode code page to use: iso-8859-6, or cp1256. The choice will only make a difference if the source data is actually cp1256, and it uses some of the non-ISO code points that M$ is so fond of (e.g. "smart quotes", "special hyphen", possibly even some accented Latin letters like é): those will show up as "\x{fffd}" when you treat the input as iso-8859-6 when converting to unicode.

    So the best strategy is: assume the data is really cp1256 (because this will cause no harm to iso-8859-6 data) -- a simple one-liner will do:

    perl -e 'open(I,q/<:encoding(cp1256)/,shift);binmode STDOUT,q/:utf8/;p +rint <I>' ifile > ofile
    (It's very unlikely that the source data involves the old and all-too-clever MacArabic encoding, but if it did, there is a module for handling that as well: Lingua::AR::MacArabic. This encoding stands out in having two versions of various bracketing characters, left-to-right vs right-to-left. Don't worry about it.)

    (updated to shorten the one-liner a bit (then a bit more), and to put a real accented letter in the second paragraph)

Re: Encoding problem
by grantm (Parson) on May 09, 2009 at 22:54 UTC

    I had a similar problem with a database containing some Latin-1 characters and some UTF-8 characters. I created the CPAN module Encoding::FixLatin to fix this kind of thing. The distribution includes a command line filter that you can pipe a file through.

    It won't be able to unravel things that have been double encoded but if you can get back to a raw stream of Latin-1 bytes mixed with UTF-8 bytes then it should convert it all to UTF-8.