Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
In the code snippet below, why isn't $a marked as utf-8? The problem I am having is that my database expects me to send it valid utf8. If the string I'm sending it has wide characters as $b does, then everything is fine. If I send it the data in $a it gets truncated because of invalid utf8. (The connection to the DB socket is raw I guess, and so no automatic conversion to utf8 is done.)
Must I check everything I write to the DB with Encode::is_utf8 and decode anything that isn't marked as utf8?
I have read all the Perl Unicode related docs but haven't reached enlightenment. Any help understanding this issue is greatly appreciated. (This is perl 5.8.6)
#!/usr/bin/perl use strict; use warnings; use Encode; use Data::Dumper; use Test::More qw(no_plan); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); my $a = "\x{ae}"; my $b = "\x{ae}\x{2122}"; ok( ! Encode::is_utf8($a), '\x{ae} is not marked as utf-8' ); ok( Encode::is_utf8($b), '\x{ae}\x{2122} is marked as utf-8' ); ok( length $a == 1, '\x{ae} character length 1' ); ok( length $b == 2, '\x{ae}\x{2122} character length 2' ); ok( do { use bytes; length $a } == 1, '\x{ae} byte length 1' ); ok( do { use bytes; length $b } == 5, '\x{ae}\x{2122} byte length 5'); chop $b; ok( Encode::is_utf8($b), '$b still utf-8 after chop' ); ok( length $b == 1, '$b character length 1 after chop' ); ok( do { use bytes; length $b } == 2, '$b byte length 2 after chop' ); is( $a, $b, '$a eq $b after utf8 upgrade of $a' ); my $x = decode('iso-8859-1', $a ); ok( Encode::is_utf8($b), '$x is utf-8' ); ok( $x eq $b ); Encode::_utf8_off( $b ); Encode::_utf8_off( $x ); ok( $x eq $b ); is( $a, $b, '$a and $b are byte same' );
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Why does perl not mark variable as utf-8?
by Joost (Canon) on Aug 29, 2007 at 12:10 UTC | |
by Anonymous Monk on Aug 29, 2007 at 13:51 UTC | |
by Joost (Canon) on Aug 29, 2007 at 14:20 UTC | |
by Anonymous Monk on Aug 29, 2007 at 14:59 UTC | |
by Joost (Canon) on Aug 29, 2007 at 15:14 UTC | |
by almut (Canon) on Aug 29, 2007 at 17:13 UTC | |
|
Re: Why does perl not mark variable as utf-8?
by Gangabass (Vicar) on Aug 29, 2007 at 11:50 UTC |