Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: The Queensr˙che Situation

by ikegami (Patriarch)
on Oct 19, 2014 at 22:49 UTC ( [id://1104351]=note: print w/replies, xml ) Need Help??


in reply to The Queensr˙che Situation

Why is the "˙" not printing correctly here in my terminal?

Your terminal expects UTF-8. You printed chr(0xFF), which is not the UTF-8 encoding of "˙".

You can encode it yourself, or you ask Perl to do it using the following:

use open ':std', ':encoding(UTF-8)';

ord() returns 255 for ˙, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?

It's not UTF-8 (which would be C3 BF). is_utf8($string) does not indicate whether $string contains UTF-8.

It's not UTF-16 (which would be 00 FF or FF 00 depending on endianness).

Decoding string (as use utf8; does for literals) results in Unicode Code Points ("˙" is U+00FF).

This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?

That is the UTF-8 encoding of "Queensr˙che", though it is incorrect to say that is_utf8 signifies that Encode agrees.

Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?

Tools that work with text (such as regular expressions and Text::Unaccent::PurePerl) usually expect the text to be provided as strings of Unicode Code Points, not encoded using UTF-8.

Is there a way to safely convert them to the same encoding?

Aformentioned

use open ':std', ':encoding(UTF-8)';
will also tell Perl to decode bytes read from file handles.
use utf8; use encoding ':std', ':encoding(UTF-8)'; use JSON::XS qw( decode_json encode_json ); my $s = "Queensr˙che"; printf("U+%v04X %s\n", $s, $s); { # Uses encoding specified by "use open". open(my $fh, '>', 'foo.txt') or die $!; print($fh "$s\n"); } { # Uses encoding specified by "use open". open(my $fh, '<', 'foo.txt') or die $!; chomp( my $got = <$fh> ); printf("U+%v04X %s\n", $got, $got); } { # :raw overrides default encoding specified above # since encode_json already encodes using UTF-8 open(my $fh, '>:raw', 'foo.json') or die $!; print($fh encode_json( { text => $s } )); } { my $json = do { # Similarly, decode_json expects UTF-8. open(my $fh, '<:raw', 'foo.json') or die $!; local $/; <$fh> }; my $got = decode_json($json)->{text}; printf("U+%v04X %s\n", $got, $got); }

Replies are listed 'Best First'.
Re^2: The Queensr˙che Situation
by Rodster001 (Pilgrim) on Oct 19, 2014 at 23:16 UTC
    Got it. So, "is_utf8" just tells us that the utf-8 flag is set?

      Exactly. It merely says which internal storage format is used. It's only useful for debugging XS modules, if at all.

      (Added plain text example to the program in my earlier post.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1104351]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-26 00:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found