Backstory: I have a Perl site that stores all its "database" tables as flat JSON documents on disk. I wanted to make sure Unicode is supported all throughout the system, so I was running some experiments to make sure I'm reading/writing to the files properly.
#!/usr/bin/perl -w use 5.14.0; use utf8; use Encode; use JSON; my $json = JSON->new->utf8->pretty(); STDOUT->binmode(":utf8"); my $umbreon = "ブラッキー"; my $test = "\x{100}\x{2764}"; say "Name: $umbreon (is utf8: " . utf8::is_utf8($umbreon) . ")"; say "Test: $test (is utf8: " . utf8::is_utf8($test) . ")"; my $data = { name => $umbreon, test => $test, }; open (my $fh, ">", "utf8-test.txt"); binmode($fh, ":utf8"); print {$fh} "Name: $umbreon\nTest: $test\n"; close($fh); my $encoded = $json->encode($data); print "encoded (original): $encoded (is utf8: " . utf8::is_utf8($encod +ed) . ")\n"; $encoded = Encode::decode("utf-8", $encoded); print "encoded: $encoded (utf8: " . utf8::is_utf8($encoded) . ")\n"; open (my $fh2, ">", "utf8-test.json"); binmode($fh2, ":utf8"); print {$fh2} $encoded; close($fh2); __END__ [kirsle@vostro ~]$ perl utf8-test.pl Name: ブラッキー (is utf8: 1) Test: Ā❤ (is utf8: 1) encoded (original): { "test" : "€❤", "name" : "ƒ–ƒƒƒ‚ƒ" } (is utf8: ) encoded: { "test" : "Ā❤", "name" : "ブラッキー" } (utf8: 1)
From tinkering with this code, some observations:
=head2 data utf8_decode (data) Recursively UTF8-decode a data structure. Any data structure. Decoding turns on the UTF-8 flag in Perl and makes Perl treat the data + as string (so methods like C<length()> are accurate). If you want to prin +t the data over the network, you need to B<encode> it into bytes. =cut sub utf8_decode { my $data = shift; my $encode = shift || 0; # If it's a data structure (hash or array), recurse over its conte +nts. if (ref($data) eq "HASH") { foreach my $key (keys %{$data}) { # Another data structure? if (ref($data->{$key})) { # Recurse. $data->{$key} = utf8_decode($data->{$key}, $encode); next; } # Encode the scalar. $data->{$key} = utf8_decode($data->{$key}, $encode); } } elsif (ref($data) eq "ARRAY") { foreach my $key (@{$data}) { # Another data structure? if (ref($key)) { # Recurse. $key = utf8_decode($key, $encode); next; } # Encode the scalar. $key = utf8_decode($key, $encode); } } else { # This is a leaf node in our data (a scalar). Encode UTF-8! my $is_utf8 = utf8::is_utf8($data); # Are they *encoding* (turning into bytes) instead of *decodin +g*? if ($encode) { # Encoding (making bytestream): only decode IF it is curre +ntly UTF8. return $data unless $is_utf8; $data = Encode::encode("UTF-8", $data); } else { # Decoding. If it's ALREADY UTF-8, do not decode it again. return $data if $is_utf8; $data = Encode::decode("UTF-8", $data); } } return $data; } =head2 data utf8_encode (data) Recursively UTF8 encode a data structure. B<Encoding> means turning th +e data into a byte stream (so string operators like C<length()> will be inacc +urate). Encoding is necessary to transmit a Unicode string over a network. =cut sub utf8_encode { my $data = shift; return utf8_decode($data, 1); }
This is handy when you need to convert strings to or from byte streams. I apply a decode to all the incoming CGI params (just to be sure they're all UTF-8), and when I need to take an md5_hex of a string, I encode it to bytes first (because Digest::MD5 will croak if you try to md5_hex a string with wide characters).
In reply to JSON, UTF-8 and Filehandles by Kirsle
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |