‹‹‹‹

I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.

What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:

#!/usr/bin/perl + + use strict; use warnings; require "Encode.pm"; require "utf8.pm"; open FILE, "test.txt" or die $!; my @lines = <FILE>; my $test = $lines[0]; print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; my @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; my @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; $test = Encode::decode("utf8", $test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; $test = Encode::decode("utf8", $test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n";

This gives the following output:

Length: 7

utf8 flag: 

Unicode:

195 168 194 171 194 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 4

utf8 flag: 1

Unicode:

232 171 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 2

utf8 flag: 1

Unicode:

35531 10

Hex:

e8ab8b0a

This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead:

#!/usr/bin/perl + + use strict; use warnings; require "Encode.pm"; require "utf8.pm"; open FILE, "test.txt" or die $!; my @lines = <FILE>; my $test = $lines[0]; print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; my @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; my @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; utf8::decode($test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; utf8::decode($test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n";

This gives almost identical output, only the result of length differs:

Length: 7

utf8 flag: 

Unicode:

195 168 194 171 194 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 4

utf8 flag: 1

Unicode:

232 171 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 4

utf8 flag: 1

Unicode:

35531 10

Hex:

e8ab8b0a

It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?

Thanks,
Matt


In reply to utf8::decode vs. Encode::decode with regard to the length function by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.