Re: Why does perl's internal utf8 seem to allow single-byte latin1?

What I expected from the script was the two byte sequence in all cases.

What I don't understand is that you go to the trouble of writing a script that shows your problem (good!), but then fail to 1) show the output of the script, and 2) writing down what you expected it to print.

Not having people have to cut, paste and run your script greatly improves the chances of getting a useful answer.

Comment on Re: Why does perl's internal utf8 seem to allow single-byte latin1?

Replies are listed 'Best First'.
Re^2: Why does perl's internal utf8 seem to allow single-byte latin1? by brycen (Monk) on Mar 24, 2010 at 18:13 UTC
The output of the script is in the original post: `Ein <#c3><#96>konomisches Modell Ein <#d6>konomisches Modell 1 Ein <#d6>konomisches Modell 1` [download] I expected to see <#c3><#96> in all three cases. The actual confusion on the test script turns out to be a bug in the "print_chrcodes" function. I either should have used Devel::Peek, or ensured the function used "use bytes": `sub print_chrcodes { my $str = shift; my $ret; use bytes; #<<<<<<<<<here foreach my $ascval (unpack("C*", $str)) { if($ascval == 13) { $ret .= '<cr>'; next; } if($ascval == 10) { $ret .= '<nl>'; next; } if($ascval < 128) { $ret .= chr($ascval); next; } $ret .= sprintf ("<#%x>",$ascval) if($ascval >= 128); } return $ret; }` [download] What I was diagnosing was a situation where Unicode did not cleanly go through Storable::nfreeze, a database, then Storable::thaw : `$hash->{TITLE}="f\xc3\xa4ce=\xe2\x98\xbb"; print "TITLE=$hash->{TITLE}\n"; my $nfreeze = Storable::nfreeze($hash); my $obj = Storable::thaw($nfreeze); print '${^UNICODE}='.${^UNICODE}."\n"; print "TITLE=$obj->{TITLE}\n";` [download] That eventually resolved itself. The setting of PERL_UNICODE=SAD vs. PERL_UNICODE=0 was masking the true problem, as it allowed certain sequences perl thought of as latin1 to be seen on the terminal as their equivalent utf-8 character (in my test case, black smiling face ☻). So in the end with: use encoding "utf-8-strict"; added to the script, and PERL_UNICODE=SAD, and proper terminal settings, and Apache Accept-Charset, and SET CLIENT-ENCODING UTF8 I'm round tripping Uncode through the entire Browser->Apache->Perl->Database->Perl->Apache->Browser System. Thanks for the help.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Why does perl's internal utf8 seem to allow single-byte latin1?
by brycen (Monk) on Mar 24, 2010 at 18:13 UTC

Ein <#c3><#96>konomisches Modell
Ein <#d6>konomisches Modell 1
Ein <#d6>konomisches Modell 1
[download]

sub print_chrcodes
{
    my $str = shift;
    my $ret;
    use bytes;  #<<<<<<<<<here
    foreach my $ascval (unpack("C*", $str)) {
        if($ascval == 13) {
            $ret .= '<cr>';
            next;
        }
        if($ascval == 10) {
            $ret .= '<nl>';
            next;
        }
        if($ascval <  128) {
            $ret .= chr($ascval);
            next;
        }
        $ret .= sprintf ("<#%x>",$ascval) if($ascval >= 128);
    }
    return $ret;
}
[download]

$hash->{TITLE}="f\xc3\xa4ce=\xe2\x98\xbb";
print "TITLE=$hash->{TITLE}\n";
my $nfreeze = Storable::nfreeze($hash);
my $obj     = Storable::thaw($nfreeze);
print '${^UNICODE}='.${^UNICODE}."\n";
print "TITLE=$obj->{TITLE}\n";
[download]

So in the end with: use encoding "utf-8-strict"; added to the script, and PERL_UNICODE=SAD, and proper terminal settings, and Apache Accept-Charset, and SET CLIENT-ENCODING UTF8 I'm round tripping Uncode through the entire Browser->Apache->Perl->Database->Perl->Apache->Browser System.

Thanks for the help.

[reply]
[d/l]
[select]