Re: Text storage/retrieval

One idea is to use the __DATA__ block to store texts

Why go through the process of having to convert text to a hash at runtime, everytime; and load all languages for every run.

Why not just store them in a .pl. (One per language if you can know what language you are going to use in advance) file:
text.EN
```
(
"...", # 0
"...", # 1
...
);
[download]
```
And then you can do:
```
my $lang = determineLang();
my @text = do "$lang.pl";

...

print $text[ 27 ];
[download]
```
You can save a little more time by not even parsing the list, by using Storable, but that has a bad press with some people.
```
use Storable [thaw];

...

my $lang = determineLang();
my $text = thaw( "$lang.sto" );
...

print $text->[ 27 ];
[download]
```
You would use a separate small app to build and write the binary storable files.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re: Text storage/retrieval Select or Download Code

Replies are listed 'Best First'.
Re^2: Text storage/retrieval by tobyink (Canon) on Mar 05, 2012 at 13:39 UTC
Why go through the process of having to convert text to a hash at runtime, everytime; and load all languages for every run. Putting each language in its own file would certainly be an improvement - I grant you that. However, converting text to the hash at run time is actually faster than hardcoding the hash, at least for simple data. Yes, that's right. This: `my %hash; while (<DATA>) { my ($k, $v) = split /\t/o; $hash{$k} = $v; } __DATA__ 440035528809 6946395707444 332679554392 162874763688655 913537320343 56726180700920` [download] is faster than this: `my %hash = ( 440035528809=>'6946395707444', 332679554392=>'162874763688655', 913537320343=>'56726180700920', );` [download] Or at least it is once you've got more than a few hundred entries in the hash. It seems counter-intuitive, but it makes sense when you think about it. In the first example we're parsing a very simple text format using Perl (and Perl is very fast at text handling!); in the second we're parsing a programming language using C. Read more... (994 Bytes) I did quite a bit of benchmarking on this sort of thing for Crypt::XkcdPassword.	[reply] [d/l] [select]
Re^3: Text storage/retrieval by BrowserUk (Patriarch) on Mar 05, 2012 at 14:09 UTC
Now try it with phrases that can contain spaces and commas and quotes of either forms and even newlines? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^4: Text storage/retrieval by tobyink (Canon) on Mar 05, 2012 at 14:54 UTC
Hence my "at least for simple data". My example uses tabs and newlines as delimiters, so those characters cannot appear in the data. That said, using something like JSON::XS you can get even faster than either of the previous two examples. And of course JSON gives you escaping, multiline strings, etc. `use strict; use JSON; my %hash = %{from_json(do{ local $/ = <DATA> })}; __DATA__ { "440035528809":"6946395707444", "332679554392":"162874763688655", "913537320343":"56726180700920" }` [download] With a hash of 500_000 entries, I get: Standard Perl hash... 11.04user 0.31system 0:11.58elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+41519minor)pagefaults 0swaps Reading TSV from __DATA__... 6.15user 0.14system 0:06.38elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+13860minor)pagefaults 0swaps Reading JSON from __DATA__... 4.25user 0.26system 0:04.64elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+38709minor)pagefaults 0swaps Of course, loading the JSON module introduces some overhead, so on smaller datasets the other techniques beat it. With a hash of 1000 entries, I get: Standard Perl hash... 0.03user 0.00system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+629minor)pagefaults 0swaps Reading TSV from __DATA__... 0.01user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+566minor)pagefaults 0swaps Reading JSON from __DATA__... 0.10user 0.00system 0:00.11elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+871minor)pagefaults 0swaps It seems to be at around the 5000 hash entry mark that JSON::XS starts winning over a hard-coded Perl hash, and around 12000 hash entries it starts winning over tab-delimited data. Read more... (2 kB)	[reply] [d/l] [select]
Re^2: Text storage/retrieval by DreamT (Pilgrim) on Mar 05, 2012 at 13:32 UTC
One aspect is the maintainability - it wouöd be great if the data could be stored in csv files or such. Any idea on that?	[reply]
Re^3: Text storage/retrieval by BrowserUk (Patriarch) on Mar 05, 2012 at 13:39 UTC
I see little difference in maintainablility between: `( "The quick", "brown fox", "jumps over", "the lazy", "dog", );` [download] And: `"The quick", "brown fox", "jumps over", "the lazy", "dog"` [download] But if you do, you could do the same thing -- put each language into a separate csv file -- and do: `my @text = someCSVparser( "$lang.csv" ); ...` [download] It'll be slower, but for 1500 strings, probably not enough to worry about. If performance is a concern -- as it seemed from your OP -- then you could store the texts in .csv files and use an off-line process to create the Storable form from them whenever they change. It has the advantage of ensuring that if the storable format shoudl ever change in incompatible ways -- it has happened in the past -- then you have the sources to fall back on. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]