cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks. I am trying to read a file full of JSON objects representing Twitter posts like so:
use feature ':5.10'; use JSON::XS; my $path = 'C:\Downloads'; $path =~ s/\\/\//g; my $fn = 'twitter_raw.json'; open(IN,"< :encoding(UTF-8)", "$path/$fn") or die "Can't open input: $ +!\n"; while(<IN>) { chop; my $j = decode_json($_); }
The decode fails on the first try with "Wide character in subroutine entry at readtweets.pl line 15, <IN> line 1." Here is the string it is choking on. The error doesn't say which char is the problem. I don't see anything amiss in the string. If I save it as example.json it opens in Firefox with no parsing errors. How to debug this?

Replies are listed 'Best First'.
Re: JSON::XS Wide Character Problem
by 1nickt (Canon) on Jun 04, 2022 at 11:14 UTC

    Hi, JSON::XS does UTF8 decoding by default when you use the functional interface (see the documentation), so you don't need the encoding layer on your open call.

    In this example I am allowing JSON::XS to decode from UTF8, and then encoding back to UTF8 when I want to print part of the data.

    use strict; use warnings; use JSON::XS; use Path::Tiny; use Encode 'encode_utf8'; my $path = '~/monks/json.txt'; my $json = path($path)->slurp; my $data = decode_json($json); my $text = encode_utf8($data->{text}); print $text, "\n"; __END__

    Hope this helps!


    The way forward always starts with a minimal test.
      That did it, and it also solved the problem of processing a large file full of those objects (see reply above).
Re: JSON::XS Wide Character Problem
by graff (Chancellor) on Jun 04, 2022 at 00:05 UTC
    Use "slurp" mode to read the entire file content into a single scalar variable, and DO NOT USE "chop" on that string. UPDATE: Also, apparently you don't need to use ":encoding(UTF-8)" when opening the file. This seems to work:
    #!/usr/bin/perl use strict; use warnings; use JSON::XS; open(IN,"example.json") or die $!; $/ = undef; $_ = <IN>; my $j = decode_json($_); print "ok\n"; print "keys are: ".join("\n",keys(%$j))."\n";
      The thing is, I have a huge file with like a million of those objects like I included in the pastebin. I'm afraid Trying to read in that whole thing and then split(/\n/) it might lead to memory issues. That would also effectively chop the strings. Is there a way to successfully read it a line (= one json object) at a time?
        An important feature of / reason for using json data and JSON::XS is that you never need to use split() on the input text.

        If the actual size of your "huge file with like a million of those objects" really is known for certain to be problematic for the memory capacity on the machine you're using, read the section of the JSON::XS manual that talks about "INCREMENTAL PARSING". Also learn about the $json->shrink() function.

        UPDATE: / In particular, look at the section of the manual that contains this sentence: "Assume that you have a gigantic JSON array-of-objects, many gigabytes in size, and you want to parse it, but you cannot load it into memory fully (this has actually happened in the real world :)." /

        Some json files do not have line-breaks at all (and those that do may vary as to "CRLF" vs. "LF" style). Even if you think you're very confident about knowing the format/layout of the json data, I'd say it's virtually never a good idea to treat json data as line-oriented input. Don't do that.

        > Is there a way to successfully read it a line (= one json object) at a time?

        If there is a good separator like blank line(s), you can set the input record separator $/ accordingly.

        of course you could also read line by line if you are sure that those JSON strings never include "\n"

        cheers

        LanX (logged out)