Very large text file - simple indexing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I apologize if this is an utterly clueless question, but I'm a newbie so here goes:

All I want to do is to be able to correlate two entries in a very large text file, for instance the file is of the format:

...and so on with tabs separating the columns. I'm going to be accessing this array hundreds of thousands of times. Now the approach for this that seemed obvious to me was just like this:

while (chomp($input=$handle->getline))
 {
  @temp=split("\t",$input);
  $index[$temp[0]]=$temp[1];
 }
[download]

Which is great, now we have $index(field 1 value) = field 2 value, which is what I wanted. and works great for the original 40 meg text file.

The problem comes with the new text file, which is almost 400 megs (many millions of entries). My system slows to a crawl and eventually seizes up. I assume this is most likely memory issues (only 256M physical due to a RAM blowout, another 500 M swap) or something, but I'm not sure.

Is there a better way to do this kind of thing for a huge text file like this?

Comment on Very large text file - simple indexing Select or Download Code

Replies are listed 'Best First'.
Re: Very large text file - simple indexing by zby (Vicar) on Apr 09, 2003 at 20:13 UTC
You can use a tied hash for it. DB_File is an example module implementing that (or you can use the Berkeley DB).	[reply]
Re: Re: Very large text file - simple indexing by diotalevi (Canon) on Apr 09, 2003 at 20:43 UTC
BerkeleyDB is even nicer once you move into non-trivial amounts of data (I call that hundreds of megs). If you need an ordered database see BerkeleyDB::Hash.	[reply]
Re: Very large text file - simple indexing by BrowserUk (Patriarch) on Apr 09, 2003 at 20:52 UTC
Your post is a little confusing. In the sample data, you show field_1 as numeric and apparently incrementing by one, but it doesn't start from either zero or one, and in the sample code, you are using this numeric field as the index into an array (`$index[$temp[0]]=$temp[1];`), then immediately following it you show `$index(field 1 value) = field 2 value` using parens rather than square brackets. Is field_1 always numeric? Are they sequential? Is the file (or can it be) sorted by this first field in ascending order? Does Field_1 start from 0 or 1? If the answer to all these questions is yes, then possibly the easiest solution to the problem would be to use Tie::File. Read the excellent documentation for this module for the full nitty-gritty, but simply stated, it allows you, with a single statement, to treat a file as an array. Once you have tied the array to the file, you can just use the array as if it were entirely in memory and it takes care of caching, flushing, opening & closing it. You can specify how much memory you wish to allocate to the caching of the file and thereby make your own choices about the trade-off between memory use and performance. The only downside given your file format is that each array element would contain both fields, but it would be a fairly trivial process to modify the module for your own purposes to remove field_1 on the FETCHes and replace it on STOREs. If not all the answers to the 4 questions above are yes, for example if sequence numbers do not start from 0 or 1, or if the sequences have large gaps, then you would need to make more substantial changes to the module to map the sequence numbers to record numbers, which may be more work than you want to do, but it's worth considering if there is a algorithmic relationship involved. Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l] [select]
Re: Very large text file - simple indexing by waxmop (Beadle) on Apr 09, 2003 at 20:06 UTC
Do you really need all the rows loaded in memory at once? If so, I suggest stuffing all this stuff into a MySQL database.	[reply]
Re: Very large text file - simple indexing (seek) by tye (Sage) on Apr 10, 2003 at 16:31 UTC
Those records are so simple, I'd make sure they were fixed length (rewriting the file if necessary) and simply `seek( FILE, $index*$reclen, 0 )` then readline (`<FILE>`). - tye	[reply] [d/l]