The thing is, I have a huge file with like a million of those objects like I included in the pastebin. I'm afraid Trying to read in that whole thing and then split(/\n/) it might lead to memory issues. That would also effectively chop the strings. Is there a way to successfully read it a line (= one json object) at a time? | [reply] [d/l] |
An important feature of / reason for using json data and JSON::XS is that you never need to use split() on the input text.
If the actual size of your "huge file with like a million of those objects" really is known for certain to be problematic for the memory capacity on the machine you're using, read the section of the JSON::XS manual that talks about "INCREMENTAL PARSING". Also learn about the $json->shrink() function.
UPDATE: / In particular, look at the section of the manual that contains this sentence: "Assume that you have a gigantic JSON array-of-objects, many gigabytes in size, and you want to parse it, but you cannot load it into memory fully (this has actually happened in the real world :)." /
Some json files do not have line-breaks at all (and those that do may vary as to "CRLF" vs. "LF" style). Even if you think you're very confident about knowing the format/layout of the json data, I'd say it's virtually never a good idea to treat json data as line-oriented input. Don't do that.
| [reply] [d/l] [select] |
> Is there a way to successfully read it a line (= one json object) at a time?
If there is a good separator like blank line(s), you can set the input record separator $/ accordingly.
of course you could also read line by line if you are sure that those JSON strings never include "\n"
cheers
LanX (logged out)
| [reply] |
UPDATE
The suggested use of JSON::XS#INCREMENTAL-PARSING looks like the much better solution for such kind of parsing problems. graff++
> If there is a good separator like blank line(s), you can set the input record separator $/ accordingly.
You didn't show us more than one object and your link is broken. So I had to guess they all start with '{"created_at":' at line's start.
And I shortened your objects for better demonstration
use v5.12.0;
use warnings;
use Data::Dump;
use JSON::XS;
my $ident ='{"created_at":';
local $/ = "\n$ident";
my $prefix="";
while (<DATA>) {
chomp; # removes $/ from the end
my $obj = "$prefix$_";
ddx $obj;
say "-" x 30;
dd decode_json($obj);
#say "-" x 72;
$prefix = $ident;
}
__DATA__
{"created_at":"Sat Mar 02 18:45:26 +0000 2013","id":307924626426695681
+,"id_str":"307924626426695681","etc":"***YADDA YADDA ***","id_str":"2
+621098741943851970"}
{"created_at":"Sat Mar 02 18:45:26 +0000 2013","id":307924626426695681
+,"id_str":"307924626426695681","etc":"***YADDA YADDA ***","id_str":"2
+621098741943851970"}
{"created_at":"Sat Mar 02 18:45:26 +0000 2013","id":307924626426695681
+,"id_str":"307924626426695681","etc":"***YADDA YADDA ***","id_str":"2
+621098741943851970"}
# raw_json.pl:16: "{\"created_at\":\"Sat Mar 02 18:45:26 +0000 2013\",
+\"id\":307924626426695681,\"id_str\":\"307924626426695681\",\"etc\":\
+"***YADDA YADDA ***\",\"id_str\":\"2621098741943851970\"}"
------------------------------
{
created_at => "Sat Mar 02 18:45:26 +0000 2013",
etc => "***YADDA YADDA ***",
id => 307924626426695681,
id_str => 2621098741943851970,
}
# raw_json.pl:16: "{\"created_at\":\"Sat Mar 02 18:45:26 +0000 2013\",
+\"id\":307924626426695681,\"id_str\":\"307924626426695681\",\"etc\":\
+"***YADDA YADDA ***\",\"id_str\":\"2621098741943851970\"}"
------------------------------
{
created_at => "Sat Mar 02 18:45:26 +0000 2013",
etc => "***YADDA YADDA ***",
id => 307924626426695681,
id_str => 2621098741943851970,
}
# raw_json.pl:16: "{\"created_at\":\"Sat Mar 02 18:45:26 +0000 2013\",
+\"id\":307924626426695681,\"id_str\":\"307924626426695681\",\"etc\":\
+"***YADDA YADDA ***\",\"id_str\":\"2621098741943851970\"}\n"
------------------------------
{
created_at => "Sat Mar 02 18:45:26 +0000 2013",
etc => "***YADDA YADDA ***",
id => 307924626426695681,
id_str => 2621098741943851970,
}
| [reply] [d/l] [select] |
> those JSON strings never include "\n"
from what I understand code for line-breaks - i.e. CR LF and alike - are not allowed inside JSON strings. They must be represented by \n for newline (resp. \\n depending on escape rules of the source language) .
BUT line-breaks are allowed outside strings between elements of your object. They are just for formatting and ignored. (hence you don't need to chomp them either)
So assuming that one object is always inside one line depends on the source.
| [reply] [d/l] [select] |