in reply to Text storage/retrieval

One idea is to use the __DATA__ block to store texts

Why go through the process of having to convert text to a hash at runtime, everytime; and load all languages for every run.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: Text storage/retrieval
by tobyink (Canon) on Mar 05, 2012 at 13:39 UTC

    Why go through the process of having to convert text to a hash at runtime, everytime; and load all languages for every run.

    Putting each language in its own file would certainly be an improvement - I grant you that.

    However, converting text to the hash at run time is actually *faster* than hardcoding the hash, at least for simple data.

    Yes, that's right. This:

    my %hash; while (<DATA>) { my ($k, $v) = split /\t/o; $hash{$k} = $v; } __DATA__ 440035528809 6946395707444 332679554392 162874763688655 913537320343 56726180700920

    is faster than this:

    my %hash = ( 440035528809=>'6946395707444', 332679554392=>'162874763688655', 913537320343=>'56726180700920', );

    Or at least it is once you've got more than a few hundred entries in the hash.

    It seems counter-intuitive, but it makes sense when you think about it. In the first example we're parsing a very simple text format using Perl (and Perl is very fast at text handling!); in the second we're parsing a programming language using C.

    I did quite a bit of benchmarking on this sort of thing for Crypt::XkcdPassword.

      Now try it with phrases that can contain spaces and commas and quotes of either forms and even newlines?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Hence my "at least for simple data". My example uses tabs and newlines as delimiters, so those characters cannot appear in the data.

        That said, using something like JSON::XS you can get even faster than either of the previous two examples. And of course JSON gives you escaping, multiline strings, etc.

        use strict; use JSON; my %hash = %{from_json(do{ local $/ = <DATA> })}; __DATA__ { "440035528809":"6946395707444", "332679554392":"162874763688655", "913537320343":"56726180700920" }

        With a hash of 500_000 entries, I get:

        Standard Perl hash...
        11.04user 0.31system 0:11.58elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
        0inputs+0outputs (0major+41519minor)pagefaults 0swaps
        Reading TSV from __DATA__...
        6.15user 0.14system 0:06.38elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
        0inputs+0outputs (0major+13860minor)pagefaults 0swaps
        Reading JSON from __DATA__...
        4.25user 0.26system 0:04.64elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
        0inputs+0outputs (0major+38709minor)pagefaults 0swaps
        

        Of course, loading the JSON module introduces some overhead, so on smaller datasets the other techniques beat it. With a hash of 1000 entries, I get:

        Standard Perl hash...
        0.03user 0.00system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
        0inputs+0outputs (0major+629minor)pagefaults 0swaps
        Reading TSV from __DATA__...
        0.01user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
        0inputs+0outputs (0major+566minor)pagefaults 0swaps
        Reading JSON from __DATA__...
        0.10user 0.00system 0:00.11elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
        0inputs+0outputs (0major+871minor)pagefaults 0swaps
        

        It seems to be at around the 5000 hash entry mark that JSON::XS starts winning over a hard-coded Perl hash, and around 12000 hash entries it starts winning over tab-delimited data.

Re^2: Text storage/retrieval
by DreamT (Pilgrim) on Mar 05, 2012 at 13:32 UTC
    One aspect is the maintainability - it wouöd be great if the data could be stored in csv files or such. Any idea on that?

      I see little difference in maintainablility between:

      ( "The quick", "brown fox", "jumps over", "the lazy", "dog", );

      And:

      "The quick", "brown fox", "jumps over", "the lazy", "dog"

      But if you do, you could do the same thing -- put each language into a separate csv file -- and do:

      my @text = someCSVparser( "$lang.csv" ); ...

      It'll be slower, but for 1500 strings, probably not enough to worry about.

      If performance is a concern -- as it seemed from your OP -- then you could store the texts in .csv files and use an off-line process to create the Storable form from them whenever they change. It has the advantage of ensuring that if the storable format shoudl ever change in incompatible ways -- it has happened in the past -- then you have the sources to fall back on.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?