in reply to fast lookups in files
What would be the best way (defining best as the ratio between efficiency / simplicity of implementation) of doing this?
Is making a DB with the file or writing C subroutines to access the file the only way of improving this task?
The binary-search approach explained elsewhere in the thread seems like a reasonable approach, although I would be hesitant to go that route myself. If even loading an index swamps your machine then it seems like we are talking about serious quantities.
On that criterion alone, I would load it into a relational database and be done with it. At least then you don't have to worry about ensuring that the file is always sorted, which is one less thing to worry about when adding new keys.
Without presuming to know what you need this for, I can't help but thinking that if it were my baby, I know that sooner or later people would ask me questions like "how many keys have been used between 215000 and 220000. In that case the binary search algorithm doesn't buy you anything, but you get all that for free in database.
Be that as it may, if you really don't want to go to DB route, you can always apply a divide-and-conquer strategy to your file: take the first two (or last, or three, or multi-level) digits of the key, and write the key/value pairs out to a corresponding filename.
When you need to look up a key, figure out what file it would be located in, then slurp the entire file and do a pattern match with
my ($val) = ($_ =~ /\b$key\t(\d+)\b/);
That is, don't even bother reading the file line by line, that's too slow. Again, this is a clumsy approach: if your lookups are all over the search space, you'll spend all your time sucking up files. Then you start thinking about a cache of least-recently used files... nah, just put it in a database and be done with it.
• another intruder with the mooring in the heart of the Perl
|
|---|