Wide characters and UTF8

Bod has asked for the wisdom of the Perl Monks concerning the following question:

Could you please help me with this little script?

I am trying to pre-parse (i.e. check that I can get the data I need) a JSON file from the UK Charity Commision. The data source is not known for being well-formated but I am pretty sure the issue here is with my code and (lack of) understanding of character encoding!

use strict;
use warnings;

use Data::Dumper;
use utf8;
use JSON;

$/ = undef;
open my $fh, '<:encoding(UTF-8)', 'publicextract.charity.json' or die 
+$!;
my $data = <$fh>;
close $fh;

$data =~ s/^\x{feff}//;  # Strip off BOM

my $json = decode_json $data;  #  <-- Wide character in subroutine ent
+ry at json.pl line 15.

foreach my $j(@$json) {
    print "$j->{'charity_contact_phone'},$j->{'charity_contact_email'}
+\n";
}

print "\nComplete!\n\n";
[download]

The documentation for decode_json says it takes a UTF8 encoded string. So I have saved the JSON file as UTF8 using TextPad and opened the file with the same encoding. But, decode_json is croaking Wide character in subroutine entry

What have I overlooked?

Comment on Wide characters and UTF8 Select or Download Code

Replies are listed 'Best First'.
Re: Wide characters and UTF8 by afoken (Chancellor) on Nov 08, 2023 at 16:21 UTC
The documentation for decode_json says it takes a UTF8 encoded string. This is the doc: `decode_json` `$perl_scalar = decode_json $json_text` [download] The opposite of encode_json: expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text, returning the resulting reference. Croaks on error. (Emphasis mine) decode_json expects BYTES, not UTF-8 CHARACTERS. Feed it the non-decoded file (i.e. `open` raw, not with `:encoding`) and everything shoud work. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^2: Wide characters and UTF8 by Bod (Parson) on Nov 08, 2023 at 17:50 UTC
Thanks - that has got me much further... So I am clear, is the `:encoding` in `open` telling Perl how the file is currently encoded or is it instructing Perl to encode the data?	[reply] [d/l] [select]
Re^3: Wide characters and UTF8 by NERDVANA (Priest) on Nov 08, 2023 at 19:31 UTC
The important thing to know about Perl unicode support is that Perl does not track the type of a scalar. You, the programmer, need to keep track of whether you have a string of bytes or a string of unicode characters. The easiest way to do this is always decode bytes (like utf8 or utf16) into characters the moment it enters the program, like with your `":encoding(UTF-8)"` mode. As it happens, the `decode_json` function expects bytes as input, assuming you haven't done the decoding yet, and then it both decodes UTF-8 and parses JSON at the same time. On the other hand, if you say `JSON->new->decode($string)` that assumes you provided it with a unicode string. So in summary: `open my $fh, '<', $filename; $bytes= <$fh>; $data= decode_json($bytes);` [download] or `open my $fh, '<:encoding(UTF-8)', $filename; $chars= <$fh>; $data= JSON->new->decode($chars);` [download]	[reply] [d/l] [select]
Re^3: Wide characters and UTF8 by Corion (Patriarch) on Nov 08, 2023 at 17:53 UTC
The `:encoding` tells Perl what encoding the data in the file is in, and Perl will then decode the data and give you Unicode strings.	[reply] [d/l]