in reply to Reaped: a large text file into hash

As others have pointed out, and as I tried to bring to your attention in your previous thread, you are simply generating too much data to hope to be able to load it all in memory in a 32-bit process.

In a trivial experiment I conducted before responding to your first thread, I generated a 100MB file consisting of 2 million lines of 'phrases' generated randomly from a dictionary. I then counted the (1-4) n-grams and measured the memory used to hold them in a hash. I used a simple compression algorithm, and it still required 2GB of ram. I repeated the exercise for 150MB/3 million line file and it took 3GB.

C:\test>head -n 2m phrases.txt > 884345.dat C:\test>884345-buk 884345.dat words178691 ngrams13962318 perl.exe 4564 Console 1 2,102 +,076 K C:\test>head -n 3m phrases.txt > 884345.dat C:\test>884345-buk 884345.dat words178691 ngrams20850624 perl.exe 5724 Console 1 3,185 +,344 K

If this is in any way representative of your data, your 1GB file will consist of ~20 million lines and require 10GB of ram to hash.

If you are using a 64-bit Perl and a machine with say 16GB of memory, then building an in-memory hash is a viable option.

Otherwise, you will need to use something like BerkelyDB or a full RDBMS to hold your derived data.

But the missing information from your both your threads, is how you are going to use this data? If this is one file that will be hashed once, or once in a blue moon, with the hash being re-used many times by long running processes, then building the hash and storing it on disk in storable format may be the way to go.

On the other hand, if the hashed data going to be used by lots of short lived processes --eg. web pages--then the load time for a 10GB hash would be prohibitive.

If you need to repeat the hashing process on many different large documents and will only use the hash to generate a few statistics for each, then a multi-pass batch processing chain probably makes more sense.

Finally, if the process must be repeated many times; and you have a pool of servers at your disposal, or are prepared to purchase time on (say) Amazon's EC2, then tilly's map/reduce suggestion makes a lot of sense.

As is often the case with such questions, picking the 'best' solution is very much dependant upon having good information about how the resultant data will be used.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: a large text file into hash
by perl_lover_always (Acolyte) on Jan 28, 2011 at 10:33 UTC
    Good observations and thanks! In fact I need it once to be created and then I will access it many times. so this one time processing for me takes lots of time however it's important to be accessed later easily and not very time consuming. I can have a big memory of 50 GB but still is not enough and it goes out of memory! I tried to create the hash and then tie it, it would not also possible for me or maybe I do some errors that it still goes out of memory!
    my $t = tie(%hash, Tie::IxHash); foreach my $line (@file){ $line_count++; my @ngrams=produce_ngrams($line); foreach my $ngram (@ngrams) { #$t->Push(@{ $hash{$ngram} } => $line_count); push(@{ $hash{$ngram} }, $line_count); } }
    I have also no idea if I tie the hash, later on how can I access it from my hard drive.
      I tried to create the hash and then tie it,

      Tie::IxHash a) doesn't store to disk; b) use 2 or 3 time as much memory as a standard hash. It's purpose is to remember the order in which the keys of the hash were added which is unnecessary for your use. You should not be using this module.

      If you are going the tie'd hash root, then you need to use a module that ties the hash to a disk file. Previously I'd have recommended BerkeleyDB, but since Oracle grabbed Sun, you have to sign up and agree to let them do whatever they want before they'll let you download anything.

      There are alternatives but I don't have much experience of them, so I cannot make a recommendation.

      But, if you have 50GB of ram available, then you ought to be able to hash your 1 GB file in memory with ease.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        well, since I make the hash once and I want to use it several times later, then I'd prefer to keep it in the hard disk for the later access. what is your suggestion in this scenario?