Bod has asked for the wisdom of the Perl Monks concerning the following question:

Could you please help me with this little script?

I am trying to pre-parse (i.e. check that I can get the data I need) a JSON file from the UK Charity Commision. The data source is not known for being well-formated but I am pretty sure the issue here is with my code and (lack of) understanding of character encoding!

use strict; use warnings; use Data::Dumper; use utf8; use JSON; $/ = undef; open my $fh, '<:encoding(UTF-8)', 'publicextract.charity.json' or die +$!; my $data = <$fh>; close $fh; $data =~ s/^\x{feff}//; # Strip off BOM my $json = decode_json $data; # <-- Wide character in subroutine ent +ry at json.pl line 15. foreach my $j(@$json) { print "$j->{'charity_contact_phone'},$j->{'charity_contact_email'} +\n"; } print "\nComplete!\n\n";

The documentation for decode_json says it takes a UTF8 encoded string. So I have saved the JSON file as UTF8 using TextPad and opened the file with the same encoding. But, decode_json is croaking Wide character in subroutine entry

What have I overlooked?

Replies are listed 'Best First'.
Re: Wide characters and UTF8
by afoken (Chancellor) on Nov 08, 2023 at 16:21 UTC
    The documentation for decode_json says it takes a UTF8 encoded string.

    This is the doc:

    decode_json
    $perl_scalar = decode_json $json_text

    The opposite of encode_json: expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text, returning the resulting reference. Croaks on error.

    (Emphasis mine)

    decode_json expects BYTES, not UTF-8 CHARACTERS. Feed it the non-decoded file (i.e. open raw, not with :encoding) and everything shoud work.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Thanks - that has got me much further...

      So I am clear, is the :encoding in open telling Perl how the file is currently encoded or is it instructing Perl to encode the data?

        The important thing to know about Perl unicode support is that Perl does not track the type of a scalar. You, the programmer, need to keep track of whether you have a string of bytes or a string of unicode characters. The easiest way to do this is always decode bytes (like utf8 or utf16) into characters the moment it enters the program, like with your ":encoding(UTF-8)" mode.

        As it happens, the decode_json function expects bytes as input, assuming you haven't done the decoding yet, and then it both decodes UTF-8 and parses JSON at the same time. On the other hand, if you say JSON->new->decode($string) that assumes you provided it with a unicode string.

        So in summary:

        open my $fh, '<', $filename; $bytes= <$fh>; $data= decode_json($bytes);
        or
        open my $fh, '<:encoding(UTF-8)', $filename; $chars= <$fh>; $data= JSON->new->decode($chars);

        The :encoding tells Perl what encoding the data in the file is in, and Perl will then decode the data and give you Unicode strings.