Re^4: Text storage/retrieval

Hence my "at least for simple data". My example uses tabs and newlines as delimiters, so those characters cannot appear in the data.

That said, using something like JSON::XS you can get even faster than either of the previous two examples. And of course JSON gives you escaping, multiline strings, etc.

use strict;
use JSON;
my %hash = %{from_json(do{ local $/ = <DATA> })};
__DATA__
{
"440035528809":"6946395707444",
"332679554392":"162874763688655",
"913537320343":"56726180700920"
}
[download]

With a hash of 500_000 entries, I get:

Standard Perl hash...
11.04user 0.31system 0:11.58elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41519minor)pagefaults 0swaps
Reading TSV from __DATA__...
6.15user 0.14system 0:06.38elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+13860minor)pagefaults 0swaps
Reading JSON from __DATA__...
4.25user 0.26system 0:04.64elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+38709minor)pagefaults 0swaps

Of course, loading the JSON module introduces some overhead, so on smaller datasets the other techniques beat it. With a hash of 1000 entries, I get:

Standard Perl hash...
0.03user 0.00system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+629minor)pagefaults 0swaps
Reading TSV from __DATA__...
0.01user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+566minor)pagefaults 0swaps
Reading JSON from __DATA__...
0.10user 0.00system 0:00.11elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+871minor)pagefaults 0swaps

It seems to be at around the 5000 hash entry mark that JSON::XS starts winning over a hard-coded Perl hash, and around 12000 hash entries it starts winning over tab-delimited data.

My benchmarking code is:

use 5.010;
open my $perl, '>', 'perl.pl';
open my $data, '>', 'data.pl';
open my $json, '>', 'json.pl';

print $perl <<'CODE';
use strict;
my %hash = (
CODE

print $data <<'CODE';
use strict;
my %hash;
while (<DATA>)
{
    my ($k, $v) = split /\t/o;
    $hash{$k} = $v;
}
__DATA__
CODE

print $json <<'CODE';
use strict;
use JSON;
my %hash = %{from_json(do{ local $/ = <DATA> })};
__DATA__
{
CODE

my $last = 100_000;
for (1 .. $last)
{
    my $k = int rand 1_000_000_000_000;
    my $v = int rand 1_000_000_000_000_000;
    my $comma = $_==$last?'':',';
    print $perl "$k=>'$v',\n";
    print $data "$k\t$v\t\n";
    print $json "\"$k\":\"$v\"$comma\n";
}

print $perl <<'CODE';
);
CODE

print $json <<'CODE';
}
CODE

close $perl;
close $json;
close $data;
say "Standard Perl hash...";
system("time perl perl.pl");
say "Reading TSV from __DATA__...";
system("time perl data.pl");
say "Reading JSON from __DATA__...";
system("time perl json.pl");
unlink "perl.pl";
unlink "data.pl";
unlink "json.pl";
[download]

This example doesn't include any newline characters in the data, but for the JSON::XS approach embedded newlines in the string (if properly escaped according to JSON syntax) don't seem to make a significant difference to performance.

Comment on Re^4: Text storage/retrieval Select or Download Code