Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

UTF8 Validity

by menolly (Hermit)
on Feb 21, 2008 at 19:12 UTC ( [id://669356] : perlquestion . print w/replies, xml ) Need Help??

menolly has asked for the wisdom of the Perl Monks concerning the following question:

I'm attempting to export data from an old, non-strict database, scrub it, and import it into the stricter postgres 8.3. I've hit a snag with non-ASCII data scattered through the source. The bulk is ASCII, but an occasional field isn't -- not whole rows or whole columns, just a few fields. (The row I'm testing with has a name in Windows-1251 encoding, but there may be other encodings elsewhere.)

Like a good Monk, I searched, and found Is utf8, ascii ?, which seemed like the solution to my dilemma. Alas, my data is passing Perl's utf8::valid, but being rejected by Postgres with ERROR:  invalid byte sequence for encoding "UTF8": 0xcce5.

Obviously, I'm not an expert when it comes to handling non-ASCII data. Any further tips for detecting problem data?

Replies are listed 'Best First'.
Re: UTF8 Validity
by Juerd (Abbot) on Feb 21, 2008 at 19:48 UTC

    utf8::valid checks the internal consistency for a string. On the outside, it does not have anything to do with UTF-8 encoding at all.

      Ah. I tried it based on this reply, in the other thread. Is there a module/regex/etc. that I can use to detect non-utf8 data in a string?

        Just try to decode something as UTF-8. If that fails, fall back to something else.

        For example:

        $foo = eval { decode("UTF-8", $foo, Encode::FB_CROAK) } || decode("CP1252", $foo);

Re: UTF8 Validity
by graff (Chancellor) on Feb 21, 2008 at 22:47 UTC
    As your "next first step", I would strongly recommend some detailed diagnosis of the non-ASCII content in your data. It seems pretty clear that the stuff from your "old, non-strict database" is not utf8, and you seem to expect that there might be a mixture of different encodings being used for the characters that are not ASCII.

    So, locate the rows that contain non-ASCII characters in one or more fields, isolate those fields, and look at them in a way that shows what the non-ASCII characters are, and where they are in the string. From that, you might be able to figure out (based on the ASCII characters in the context, if any) what each non-ASCII character should be (that is, which character of which character set).

    Then all you need to do is to create edited versions of the affected rows, replacing the non-ASCII characters with their correct utf8 equivalents.

    Here is the code locate and print (in human-readable form) the affected rows:

    #!/usr/bin/perl -n print "$.:\t$_" if ( s/([^\x00-\x7f])/sprintf("\\x{%02x}",ord($1))/eg +);
    If your data contains, e.g., a row with the single-byte à (cp1252 or iso-8859-1 "letter a with grave accent") between spaces, the program above will print the row with that letter being shown as follows:
    NNN: .... \x{e0} ...
    (where "NNN" is the line number in the input file, and "..." is whatever comes before and/or after " à ", and "e0" is the hex numeric value of that byte/character) Note that this script treats the input as raw binary (or at least, it should, unless your shell environment is messing that up). If there are any multi-byte characters in the data, they will appear as sequences of two (or more) consecutive "\x{hh}" strings.

    If you find that all the rows with non-ascii data are using the same encoding, then the job is easy: use Encode (as suggested above) to convert the whole data stream from that encoding to utf8. If different encodings are used in different rows, you'll need to create some sort of mapping table, keyed by row number or something, to associate the various rows with their various appropriate encodings.

      Thanks; that's the kind of pointer I need. Most of my non-ASCII/non-UTF8 data is either in contact data or easily connected to contact data, so I've been trying to guess the charset based on the geographic origin, with mixed results. I definitely have multiple encodings present -- so far, there's cp1251 (Cyrillic), latin1, some form of Japanese, and something I can't identify but have scrubbed out in the source DB.
        Encode::Guess is likely to be helpful for figuring out the source encodings for many of the Asian (multi-byte-char) strings, though it might not help much for distinguishing among single-byte encodings. Worth a try.
Re: UTF8 Validity
by ikegami (Patriarch) on Feb 21, 2008 at 20:10 UTC

    For all my encoding needs, I use Encode. Encode concurs that "\xCC\xE5" is invalid UTF-8.

    use Encode qw( decode FB_CROAK ); my $s_utf8 = "\xCC\xE5"; print("utf8::valid = ", utf8::valid($s_utf8)?1:0,"\n"); my $s_chars = eval { decode('UTF-8', $s_utf8, FB_CROAK) }; print("valid decode = ", defined($s_chars)?1:0,"\n");
    utf8::valid = 1 valid decode = 0

    I'm having problems locating where utf8::valid is implemented, so I don't know why it considers the string valid. I suspect it returns true for all strings of bytes (aka binary strings, strings with UTF8 flag off) such as $s_utf8 in my code.

Re: UTF8 Validity
by MZSanford (Curate) on Feb 21, 2008 at 21:34 UTC
Re: UTF8 Validity
by moritz (Cardinal) on Feb 22, 2008 at 10:34 UTC
    This is slightly off-topic here, but perhaps it's useful for someone some day:

    UTF8 ne UTF-8. You can have a string that is valid UTF8 but not valid UTF-8 (UTF-8 is more strict, and allows just one way to encode each codepoint, UTF8 also allows non-canonical encodings).

    That was my first thought when I read the title "UTF8 Validity", which is not "UTF-8 Validity" ;-)