Re: JSON, Data::Dumper and accented chars in utf-8

how do I make the records output through Data::Dumper come out as "Particípio Passado" instead of "Partic\x{c3}\x{ad}pio Passado"?

The string isn't "Particípio Passado"; it's the UTF-8 encoding of "Particípio Passado". If you had "Particípio Passado", Data::Dumper would print one of the following (depending on a couple of factors):

$VAR1 = "Particípio Passado";
$VAR1 = "Partic\x{ed}pio Passado";
$VAR1 = "Partic\355pio Passado";
[download]

(The first would only show up correctly if you properly encode your output. We'll get back to that.)

If you're using JSON::XS, this problem would arise if you didn't call ->utf8.

my $json_parser = JSON::XS->new->utf8;
my $data = $json_parser->decode($json_string);
[download]

If I do a normal print of the keys, they come right

Given that the string is wrong, that indicates there are other problems. (i.e. Two wrongs made a right.)

First, I bet your Perl source file is encoded using UTF-8, but you didn't tell Perl that using

use utf8;
[download]

Secondly, I bet you have a UTF-8 terminal, but you didn't tell Perl that using one of

use open ':std', 'locale';
use open ':std', ':encoding(UTF-8)';
[download]

Comment on Re: JSON, Data::Dumper and accented chars in utf-8 Select or Download Code

Replies are listed 'Best First'.
Re^2: JSON, Data::Dumper and accented chars in utf-8 by Ralesk (Pilgrim) on Jan 21, 2012 at 22:08 UTC
I have a feeling that Dumper shouldn’t actually produce ambiguous output, so escaped something is what we should expect from it. So unless I did something horribly wrong, this is the input (sprinkled with a Unicode character outside Latin-1 range for the sake of example; converted to a UTF-8 encoded byte stream for JSON), the input parsed as JSON, printed to STDOUT on a UTF-8 terminal with your lines added. `use Data::Dumper; use Encode qw(encode); use JSON; use utf8; use open ":std", ":encoding(UTF-8)"; #use open ":std", ":locale"; ## totally didn't do anything for me my $j = qq/{ "Particípio passadő": 1 }/; my $jp = JSON->new->utf8; my $d = $jp->decode(encode("UTF-8", $j)); print "$j\n"; print Dumper($d); print Dumper($j); print Dumper(encode("UTF-8", $j));` [download] The output: { "Particípio passadő": 1 } $VAR1 = { "Partic\x{ed}pio passad\x{151}" => 1 }; $VAR1 = "{ \"Partic\x{ed}pio passad\x{151}\": 1 }"; $VAR1 = '{ "ParticÃpio passadÅ": 1 }'; According to the manual, `eval`ing the Dumper output should give us back the original data, so the escaped wide characters in the string seem right to me. Peculiar is how, when given a UTF-8 byte stream, it will not escape things and dump something awkward instead (last line). With `$Data::Dumper::Useqq` set it produces a better-looking string: $VAR1 = "{ \"Partic\303\255pio passad\305\221\": 1 }";	[reply] [d/l]
Re^3: JSON, Data::Dumper and accented chars in utf-8 by ikegami (Patriarch) on Jan 22, 2012 at 06:21 UTC
I have a feeling that Dumper shouldn’t actually produce ambiguous output, so escaped something is what we should expect from it. That's why the following should be the default: `local $Data::Dumper::Useqq = 1;` [download]	[reply] [d/l]
Re^3: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by Ralesk (Pilgrim) on Jan 21, 2012 at 22:11 UTC
Oh, I do highly disapprove of the site mangling my Unicode character ő inside a code block.	[reply]
Re^4: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by silentius (Monk) on Jan 21, 2012 at 22:42 UTC
Thank you both for your replies, although they did not solve my problem. I kept searching and solved it simply like this: `use Encode; use Encode::Escape; use Data::Dumper; use JSON; ... while ($line = <IN>) { $strut = from_json($line); print decode('unicode-escape', Dumper($strut)) . "\n"; }` [download] This now gives me the output I need, which is the accented chars displayed as they are, since the output is to be redirected to a text file and read by humans on a regular text editor. Thank you all once again.	[reply] [d/l]
Re^5: JSON, Data::Dumper and accented chars in utf-8 by Ralesk (Pilgrim) on Jan 21, 2012 at 23:04 UTC
Re^4: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by ikegami (Patriarch) on Jan 22, 2012 at 06:24 UTC
It's not the site that did that; it's your browser. "ő" doesn't exist in Windows-1252, so your browser decided to send "`ő`" instead. PerlMonks is displaying "`<code>ő</code>`" as "`ő`" as it should.	[reply] [d/l] [select]
Re^5: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by Ralesk (Pilgrim) on Jan 22, 2012 at 16:23 UTC