perlboy_emeritus has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I believe I’ve found a bug in chr,

perl -v (v5.18.2) built for darwin-thread-multi-2level (with 2 registered patches

Running in macOS 10.13.3

I’m working with external geodesic data in UTF-8, so my code includes, such as:
use utf8; # Required if using Unicode strings. ... open(FH, "<:encoding(UTF-8)", "$fileName") || die "Can't open $fileName: $!\n";
When I inspect some of my variables, either in debugger or with
simple print expressions I get:
print "Three required utf8 chars:\n \x{B0}\n \x{2032}\n \x{2033}\n"; print chr(0xB0), "\n";
The chr statement returns ‘?’ while the print statement returns the Unicode expressions:
Three required utf8 chars: ° &#8242; &#8243; ?

The print statement is correct, the chr is not. Please advise?

Replies are listed 'Best First'.
Re: Potential bug in chr
by roboticus (Chancellor) on Feb 05, 2018 at 00:58 UTC

    perlboy_emeritus:

    The use utf8; directive only tells perl that you're using Unicode in your source code file. It doesn't tell perl to perform any automatic conversions.

    The documentation (perldoc -f chr) explicitly states that chr doesn't encode the characters 128..255 (which includes 0xB0) as UTF-8 internally for backward compatibility reasons.

    However, if you tell perl to add the utf8 encoding to the output stream, then the 0xb0 will be encoded on output as you want:

    $ perl -e 'binmode STDOUT,":utf8"; print chr(0xb0),"\n"' °

    Update: I'm not really all that comfortable with Unicode stuff, so reaching for Devel::Peek, I fabricobbled this little thing together:

    $ cat pm1208450.pl use strict; use warnings; use Devel::Peek; my $a = chr(0xb0); my $b = chr(0x2032); Dump($a); Dump($b); # Combining a byte string and a unicode string converts to unicode my $c = $a . $b; Dump($c); $ perl pm1208450.pl SV = PV(0x60002c270) at 0x600079168 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x600069e70 "\260"\0 CUR = 1 LEN = 10 SV = PV(0x60002c310) at 0x600079048 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x60008f670 "\342\200\262"\0 [UTF8 "\x{2032}"] CUR = 3 LEN = 10 SV = PV(0x60002c340) at 0x6000ed1c8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x600093a10 "\302\260\342\200\262"\0 [UTF8 "\x{b0}\x{2032}"] CUR = 5 LEN = 10

    This shows that if you happen to join a byte-oriented string with a unicode string in perl, the result will be a unicode string.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks. Got it. I'll include that binmode expression whenever I'm working with external UTF-8 data and need an accurate debugging environment. I already know to assert "<:encoding(UTF-8)" on input file handles but overlooked STDOUT.
        use open ':std', ':encoding(UTF-8)';
        makes far more sense than
        binmode STDOUT, ':utf8';

        It binmodes STDIN, STDOUT and STDERR (with the safer :encoding(UTF-8)). It also sets the default encoding for the instances of open in the scope (making the :encoding('UTF-8') redundant in the open).


        This shows that if you happen to join a byte-oriented string with a unicode string in perl, the result will be a unicode string.

        Which is irrelevant to the question at hand.

        The first print worked because the string contained non-bytes (chars outside of 0..255), which can't be printed without encoding. perl guessed that you meant to encode them using UTF-8 (and warns you about this ("Wide character in...")).

        perl had no way of knowing the second print was wrong because it only contained bytes (chars in 0..255), so it printed the string unaltered.