In one of my modules (Date::Manip) I store a bunch of UTF8 data in a YAML file which I then load into a perl data structure. The basic form looks like this:

#!/usr/bin/perl use strict; use warnings; use YAML::Syck; my @in = <DATA>; my $in = join("",@in); my $dat = Load($in); 1; __DATA__ --- x : &#259;

Note: the &#259 was entered in the question as the UTF8 character ă but inside the code block, it's displayed as above. There's probably some markup I could use to get it to display properly, but I didn't want to spend too much time getting sidetracked from the problem, so just pretend that &#259 and ă are the same.

YAML::Syck has one property that I haven't found in any of the other YAML (or JSON) modules... it doesn't do any handling of UTF8 (converting to perl encoding). What you put in is what you get out, so if you run the above script in the debugger and dump the value of $dat, you get:

DB<1> p Dumper $dat $VAR1 = { 'x' => '&#259;' };

Unfortunately, YAML::Syck is perhaps the least supported of the YAML modules and I'd like to switch to one of the more recent modules. If I change the above script to use YAML or YAML::XS (my preferred module), and then run it in the debugger, I get:

DB<1> p Dumper $dat $VAR1 = { 'x' => "\x{103}" };

i.e. It displays the string as a perl encoding rather than a UTF8 encoding. I'm completely open to the option of converting the YAML to JSON, but the JSON and JSON::XS modules do the same thing. I've tried the following script with similar results:

#!/usr/bin/perl use strict; use warnings; use JSON::XS; my @in = <DATA>; my $in = join("",@in); my $dat = JSON::XS->new->decode($in); my $dat2 = JSON::XS->new->utf8(0)->decode($in); my $dat3 = JSON::XS->new->utf8(1)->decode($in); 1; __DATA__ { "x" : "&#259;" }

Obviously, once the data structure is produced, I could recurse through it and change the perl encodings back to UTF8, but rather than do that, I'll probably just stick with YAML::Syck.

Any suggestions, or do I just stick to YAML::Syck?


In reply to UTF8 with YAML or JSON by SBECK

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.