daptal has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl script to copy the contents to the db .
The db is set to default encoding LATIN1
While doing a copy to the db i set the client encoding to UTF8 using
$dbh->{pg_enable_utf8}=1; $dbh->do("SET client_encoding TO 'UTF8'") or die "could not set the cl +ient_encoding to UTF8 for $db_host $!";
However while doing the copy to the db i get the errors like
DBD::Pg::db pg_endcopy failed: ERROR: character 0xd7a4 of encoding "U +TF8" has no equivalent in "LATIN1" DBD::Pg::db pg_endcopy failed: ERROR: character 0xe4bea1 of encoding +"UTF8" has no equivalent in "LATIN1"
and the culprit keywords look like this
ืครืกรืง
ืครืกรืง
ืครืกรืง
รยฉรยงรรืครยจ
รยฉรยงรรืครยจ
รยฉรยงรรืฉร
รยฉรยงรรืฉร
รยฉรยงรรย รยช รร
รยฉรยงรรย ื รยช
รยฉรยงรรย ื รยช

My question is how can i grep for these kind of keywords. I have checked the same keywords copy to a db set to SQL_ASCII and it works fine.
Can you please suggest me how i can grep for those characters
Thanks heaps

Replies are listed 'Best First'.
Re: encoding issues
by graff (Chancellor) on Aug 31, 2010 at 02:29 UTC
    Please pardon the Shameless Plug for My Own Nodes, but I hope these will be useful...

    First of all, it'll help a lot to get a look at your data in terms of hex code-point numbers -- here's a tool you can use for that: tlu -- TransLiterate Unicode

    Next, in terms of grepping for particular unicode characters in data, there's this: grepp -- Perl version of grep

    Apart from that, in terms of getting things into the database properly, do you have the ability to create or alter tables? If so, you should be able to find the means to (re)define tables or columns to use utf8 encoding rather than the server's default latin1 encoding; that way, you won't need to worry about whether your data contains anything outside the latin1 range.

Re: encoding issues
by moritz (Cardinal) on Aug 31, 2010 at 07:31 UTC
    The db is set to default encoding LATIN1 While doing a copy to the db i set the client encoding to UTF8

    Why? Why introduce a mismatch by using a different client encoding than the database wants?

    It sounds a bit like "I have this airport, but when I try to steer ships to it, they hit a channel wall."

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: encoding issues
by aquarium (Curate) on Aug 31, 2010 at 05:07 UTC
    agree with graff that you need to rework your approach/handling...as you'll never correctly wrestle utf-8 into Latin1 every time. So going utf-8 in the db is likely the go.
    as for why sql_ascii setting doesn't give errors..it probably takes every single byte at a time on input and shamelessly shoves it into the varchar/lob field. but that shouldn't really enter the equation...as you're trying to stuff a wider set (utf-8) into a narrower set (latin1...or ascii, or whatever), which is inherently non-trivial or even non-sensical.
    the hardest line to type correctly is: stty erase ^H