in reply to Re: Text storage/retrieval
in thread Text storage/retrieval

Why go through the process of having to convert text to a hash at runtime, everytime; and load all languages for every run.

Putting each language in its own file would certainly be an improvement - I grant you that.

However, converting text to the hash at run time is actually *faster* than hardcoding the hash, at least for simple data.

Yes, that's right. This:

my %hash; while (<DATA>) { my ($k, $v) = split /\t/o; $hash{$k} = $v; } __DATA__ 440035528809 6946395707444 332679554392 162874763688655 913537320343 56726180700920

is faster than this:

my %hash = ( 440035528809=>'6946395707444', 332679554392=>'162874763688655', 913537320343=>'56726180700920', );

Or at least it is once you've got more than a few hundred entries in the hash.

It seems counter-intuitive, but it makes sense when you think about it. In the first example we're parsing a very simple text format using Perl (and Perl is very fast at text handling!); in the second we're parsing a programming language using C.

If you're interested in benchmarks, the following script generates two Perl scripts called perl.pl and data.pl:

use 5.010; open my $perl, '>', 'perl.pl'; open my $data, '>', 'data.pl'; print $perl <<'CODE'; use strict; my %hash = ( CODE print $data <<'CODE'; use strict; my %hash; while (<DATA>) { my ($k, $v) = split /\t/o; $hash{$k} = $v; } __DATA__ CODE for (0 .. 100_000) { my $k = int rand 1_000_000_000_000; my $v = int rand 1_000_000_000_000_000; print $perl "$k=>'$v',\n"; print $data "$k\t$v\t\n"; } print $perl <<'CODE'; ); CODE

In my tests, data.pl (which reads data from __DATA__) is about 40% faster than perl.pl.

I did quite a bit of benchmarking on this sort of thing for Crypt::XkcdPassword.

Replies are listed 'Best First'.
Re^3: Text storage/retrieval
by BrowserUk (Patriarch) on Mar 05, 2012 at 14:09 UTC

    Now try it with phrases that can contain spaces and commas and quotes of either forms and even newlines?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Hence my "at least for simple data". My example uses tabs and newlines as delimiters, so those characters cannot appear in the data.

      That said, using something like JSON::XS you can get even faster than either of the previous two examples. And of course JSON gives you escaping, multiline strings, etc.

      use strict; use JSON; my %hash = %{from_json(do{ local $/ = <DATA> })}; __DATA__ { "440035528809":"6946395707444", "332679554392":"162874763688655", "913537320343":"56726180700920" }

      With a hash of 500_000 entries, I get:

      Standard Perl hash...
      11.04user 0.31system 0:11.58elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (0major+41519minor)pagefaults 0swaps
      Reading TSV from __DATA__...
      6.15user 0.14system 0:06.38elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (0major+13860minor)pagefaults 0swaps
      Reading JSON from __DATA__...
      4.25user 0.26system 0:04.64elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (0major+38709minor)pagefaults 0swaps
      

      Of course, loading the JSON module introduces some overhead, so on smaller datasets the other techniques beat it. With a hash of 1000 entries, I get:

      Standard Perl hash...
      0.03user 0.00system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (0major+629minor)pagefaults 0swaps
      Reading TSV from __DATA__...
      0.01user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (0major+566minor)pagefaults 0swaps
      Reading JSON from __DATA__...
      0.10user 0.00system 0:00.11elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (0major+871minor)pagefaults 0swaps
      

      It seems to be at around the 5000 hash entry mark that JSON::XS starts winning over a hard-coded Perl hash, and around 12000 hash entries it starts winning over tab-delimited data.