Re^2: Text storage/retrieval

Why go through the process of having to convert text to a hash at runtime, everytime; and load all languages for every run.

Putting each language in its own file would certainly be an improvement - I grant you that.

However, converting text to the hash at run time is actually *faster* than hardcoding the hash, at least for simple data.

Yes, that's right. This:

my %hash;
while (<DATA>)
{
        my ($k, $v) = split /\t/o;
        $hash{$k} = $v;
}
__DATA__
440035528809    6946395707444   
332679554392    162874763688655 
913537320343    56726180700920
[download]

is faster than this:

my %hash = (
440035528809=>'6946395707444',
332679554392=>'162874763688655',
913537320343=>'56726180700920',
);
[download]

Or at least it is once you've got more than a few hundred entries in the hash.

It seems counter-intuitive, but it makes sense when you think about it. In the first example we're parsing a very simple text format using Perl (and Perl is very fast at text handling!); in the second we're parsing a programming language using C.

If you're interested in benchmarks, the following script generates two Perl scripts called perl.pl and data.pl:

use 5.010;
open my $perl, '>', 'perl.pl';
open my $data, '>', 'data.pl';

print $perl <<'CODE';
use strict;
my %hash = (
CODE

print $data <<'CODE';
use strict;
my %hash;
while (<DATA>)
{
    my ($k, $v) = split /\t/o;
    $hash{$k} = $v;
}
__DATA__
CODE

for (0 .. 100_000)
{
    my $k = int rand 1_000_000_000_000;
    my $v = int rand 1_000_000_000_000_000;
    print $perl "$k=>'$v',\n";
    print $data "$k\t$v\t\n";
}

print $perl <<'CODE';
);
CODE
[download]

In my tests, data.pl (which reads data from __DATA__) is about 40% faster than perl.pl.

I did quite a bit of benchmarking on this sort of thing for Crypt::XkcdPassword.

Comment on Re^2: Text storage/retrieval Select or Download Code

Replies are listed 'Best First'.
Re^3: Text storage/retrieval by BrowserUk (Patriarch) on Mar 05, 2012 at 14:09 UTC
Now try it with phrases that can contain spaces and commas and quotes of either forms and even newlines? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^4: Text storage/retrieval by tobyink (Canon) on Mar 05, 2012 at 14:54 UTC
Hence my "at least for simple data". My example uses tabs and newlines as delimiters, so those characters cannot appear in the data. That said, using something like JSON::XS you can get even faster than either of the previous two examples. And of course JSON gives you escaping, multiline strings, etc. `use strict; use JSON; my %hash = %{from_json(do{ local $/ = <DATA> })}; __DATA__ { "440035528809":"6946395707444", "332679554392":"162874763688655", "913537320343":"56726180700920" }` [download] With a hash of 500_000 entries, I get: Standard Perl hash... 11.04user 0.31system 0:11.58elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+41519minor)pagefaults 0swaps Reading TSV from __DATA__... 6.15user 0.14system 0:06.38elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+13860minor)pagefaults 0swaps Reading JSON from __DATA__... 4.25user 0.26system 0:04.64elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+38709minor)pagefaults 0swaps Of course, loading the JSON module introduces some overhead, so on smaller datasets the other techniques beat it. With a hash of 1000 entries, I get: Standard Perl hash... 0.03user 0.00system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+629minor)pagefaults 0swaps Reading TSV from __DATA__... 0.01user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+566minor)pagefaults 0swaps Reading JSON from __DATA__... 0.10user 0.00system 0:00.11elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+871minor)pagefaults 0swaps It seems to be at around the 5000 hash entry mark that JSON::XS starts winning over a hard-coded Perl hash, and around 12000 hash entries it starts winning over tab-delimited data. Read more... (2 kB)	[reply] [d/l] [select]