Dear Monks,

In the code snippet below, why isn't $a marked as utf-8? The problem I am having is that my database expects me to send it valid utf8. If the string I'm sending it has wide characters as $b does, then everything is fine. If I send it the data in $a it gets truncated because of invalid utf8. (The connection to the DB socket is raw I guess, and so no automatic conversion to utf8 is done.)

Must I check everything I write to the DB with Encode::is_utf8 and decode anything that isn't marked as utf8?

I have read all the Perl Unicode related docs but haven't reached enlightenment. Any help understanding this issue is greatly appreciated. (This is perl 5.8.6)

#!/usr/bin/perl use strict; use warnings; use Encode; use Data::Dumper; use Test::More qw(no_plan); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); my $a = "\x{ae}"; my $b = "\x{ae}\x{2122}"; ok( ! Encode::is_utf8($a), '\x{ae} is not marked as utf-8' ); ok( Encode::is_utf8($b), '\x{ae}\x{2122} is marked as utf-8' ); ok( length $a == 1, '\x{ae} character length 1' ); ok( length $b == 2, '\x{ae}\x{2122} character length 2' ); ok( do { use bytes; length $a } == 1, '\x{ae} byte length 1' ); ok( do { use bytes; length $b } == 5, '\x{ae}\x{2122} byte length 5'); chop $b; ok( Encode::is_utf8($b), '$b still utf-8 after chop' ); ok( length $b == 1, '$b character length 1 after chop' ); ok( do { use bytes; length $b } == 2, '$b byte length 2 after chop' ); is( $a, $b, '$a eq $b after utf8 upgrade of $a' ); my $x = decode('iso-8859-1', $a ); ok( Encode::is_utf8($b), '$x is utf-8' ); ok( $x eq $b ); Encode::_utf8_off( $b ); Encode::_utf8_off( $x ); ok( $x eq $b ); is( $a, $b, '$a and $b are byte same' );

In reply to Why does perl not mark variable as utf-8? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.