in reply to Re^2: Text storage/retrieval
in thread Text storage/retrieval

Now try it with phrases that can contain spaces and commas and quotes of either forms and even newlines?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^4: Text storage/retrieval
by tobyink (Canon) on Mar 05, 2012 at 14:54 UTC

    Hence my "at least for simple data". My example uses tabs and newlines as delimiters, so those characters cannot appear in the data.

    That said, using something like JSON::XS you can get even faster than either of the previous two examples. And of course JSON gives you escaping, multiline strings, etc.

    use strict; use JSON; my %hash = %{from_json(do{ local $/ = <DATA> })}; __DATA__ { "440035528809":"6946395707444", "332679554392":"162874763688655", "913537320343":"56726180700920" }

    With a hash of 500_000 entries, I get:

    Standard Perl hash...
    11.04user 0.31system 0:11.58elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+41519minor)pagefaults 0swaps
    Reading TSV from __DATA__...
    6.15user 0.14system 0:06.38elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+13860minor)pagefaults 0swaps
    Reading JSON from __DATA__...
    4.25user 0.26system 0:04.64elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+38709minor)pagefaults 0swaps
    

    Of course, loading the JSON module introduces some overhead, so on smaller datasets the other techniques beat it. With a hash of 1000 entries, I get:

    Standard Perl hash...
    0.03user 0.00system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+629minor)pagefaults 0swaps
    Reading TSV from __DATA__...
    0.01user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+566minor)pagefaults 0swaps
    Reading JSON from __DATA__...
    0.10user 0.00system 0:00.11elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+871minor)pagefaults 0swaps
    

    It seems to be at around the 5000 hash entry mark that JSON::XS starts winning over a hard-coded Perl hash, and around 12000 hash entries it starts winning over tab-delimited data.