fast lookups in files

citromatik has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: fast lookups in files by Limbic~Region (Chancellor) on Feb 05, 2008 at 13:23 UTC
citromatik, As has been mentioned, a binary search is probably your best bet if you are not going to change how your data is stored. You already mentioned using a database which is a fine solution. It need not be heavy - I have used DBD::SQLite for problems like this. It would take less time to set up and have working then you have spent thinking about the problem. FYI for the future - if keys were strings of letters instead of numbers, breaking the file into 26 smaller files based on starting letter and then doing a binary search would be even better. It may work here too but it would take a bit of analysis to know for sure. You would need to figure out a good way to distribute your keys between N files and a simple operation that could determine which file your desired key was in. Update: As an after thought, depending on how sparse your keys are, it may make sense to transform your file into fixed width records. You determine how many bytes the largest value you need to hold is and make all records that size. You no longer need to have keys in the file because the record number equates to the key name. Unfortunately, you need to pad missing records which may grow your file to an unmanageable size depending on how sparse things are. Something to think about. Cheers - L~R	[reply]
Re: fast lookups in files by grinder (Bishop) on Feb 05, 2008 at 13:24 UTC
What would be the best way (defining best as the ratio between efficiency / simplicity of implementation) of doing this? Is making a DB with the file or writing C subroutines to access the file the only way of improving this task? The binary-search approach explained elsewhere in the thread seems like a reasonable approach, although I would be hesitant to go that route myself. If even loading an index swamps your machine then it seems like we are talking about serious quantities. On that criterion alone, I would load it into a relational database and be done with it. At least then you don't have to worry about ensuring that the file is always sorted, which is one less thing to worry about when adding new keys. Without presuming to know what you need this for, I can't help but thinking that if it were my baby, I know that sooner or later people would ask me questions like "how many keys have been used between 215000 and 220000. In that case the binary search algorithm doesn't buy you anything, but you get all that for free in database. Be that as it may, if you really don't want to go to DB route, you can always apply a divide-and-conquer strategy to your file: take the first two (or last, or three, or multi-level) digits of the key, and write the key/value pairs out to a corresponding filename. When you need to look up a key, figure out what file it would be located in, then slurp the entire file and do a pattern match with `my ($val) = ($_ =~ /\b$key\t(\d+)\b/);` [download] That is, don't even bother reading the file line by line, that's too slow. Again, this is a clumsy approach: if your lookups are all over the search space, you'll spend all your time sucking up files. Then you start thinking about a cache of least-recently used files... nah, just put it in a database and be done with it. • another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re: fast lookups in files by roboticus (Chancellor) on Feb 05, 2008 at 12:31 UTC
citromatik: A couple of methods you could try: Scan the file and build a hash of (key,file position) pairs so you can immediately seek to the line you want. However, this requires a complete scan through the file (the first time). Alternatively, sort the file and use a binary search to locate the line. This way, you'd need roughly log2(line_count) seek/reads to get the data you're looking for. In either case, you'll probably want to refer to the functions: perldoc -f seek perldoc -f tell ...roboticus	[reply]
Re^2: fast lookups in files by citromatik (Curate) on Feb 05, 2008 at 12:54 UTC
Scan the file and build a hash of (key,file position) pairs so you can immediately seek to the line you want. Well, if a hash of key => values doesn't fit into memory I 'm afraid that a hash with keys => positions will not fit either. Alternatively, sort the file and use a binary search to locate the line. This way, you'd need roughly log2(line_count) seek/reads to get the data you're looking for. That would be a better solution (the file is already sorted) Thanks citromatik	[reply]
Re^3: fast lookups in files by apl (Monsignor) on Feb 05, 2008 at 13:35 UTC
You could have an array where subscript N indicates key N thousand, and the value of the array element would be the disk offset to look at. Let's assume you have fixed length records of 30 bytes. Element 21 having value 999990 would mean the first key greater than or equal to 21,000 was stored at disk offset 999,990. Having said all that, I'd also suggest using a binary search.	[reply]
Re: fast lookups in files by dragonchild (Archbishop) on Feb 05, 2008 at 14:30 UTC
DBM::Deep is an ideal solution for this problem. Building the DBM file might take a little time, but lookups would be pretty quick. And, it's about as simple an implementation as you could want. `my $db = DBM::Deep->new( $filename ); print "$db->{32}\n"; # prints 14456` [download] My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l]
Re: fast lookups in files by Skeeve (Parson) on Feb 05, 2008 at 13:35 UTC
As others stated: Binary Search is the key to speed! I assume you know about binary search algorithms. I think, you need not fill the keys to get a fixed record length. As you already have a sorted file, simply seek to the files center by dividing its size in bytes by 2. Then read one line. This one is to be discarded as it is most surely incomplete. Get the current fileposition as your corrected value for the file's center and read the next line in order to find out whther you hit the correct key or whether you have to advance to the upper half or go back into the lower half. again you will always have to read one line and discard it. You might miss one line using this approach: `1230 value AAA 1235 value AA 1240 value A` [download] imagine, your lower bound is 1230 and your upper bound was 1240. Getting the center of those 2 might land at "235 value A". This line will be discarded. advancing one line will hit your upper bound. in such a case you will have to read all the lines from lower to upper bound to check for a hit. This is just a rough idea i've never tested! `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: fast lookups in files by BrowserUk (Patriarch) on Feb 05, 2008 at 14:04 UTC
How many lines in your file? Over what range are the keys and values spread? How many lookups per load? Eg. Is this a web app that does half a dozen lookups per load, or a system app that does thousands per load. Is concurrency a requirement? How often does the dataset change? How fast a lookup are you hoping for? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: fast lookups in files by citromatik (Curate) on Feb 05, 2008 at 14:42 UTC
How many lines in your file? 18885025 Over what range are the keys and values spread? keys from 6 to 164664533 values from 0 to 494514 How many lookups per load? This is a stand-alone system application that does up to thousands lookups per run (although it could be called for just one lookup).This is an intermediate step in the app. The others step are well optimized for performance. Is concurrency a requirement? In principle, no How often does the dataset change? More or less, once in a month How fast a lookup are you hoping for? As much as I can. As I told before, the rest of the script is optimized for speed and memory usage and it is a bit frustrating to see how the others parts of the program (which seems to be much more complicated), runs very fast and these lookups are slowing down so much the overall process Thanks, citromatik	[reply]
Re^3: fast lookups in files by BrowserUk (Patriarch) on Feb 05, 2008 at 14:58 UTC
Can you afford 108 MB of ram? If so, rewrite your file in binary, packing each KV pair using 'NS'. 18885025 * 6 / 2*20 = 108 MB. Slurp the entire file into a single string and then use a binary chop on it something along the lines of: `open DATA, '<:raw', $datafile or die $!; my $data; sysread( DATA, $data, -s( $datafile ) ) or die $!; close DATA; sub lookup { my $target = shift; my( $left, $right ) = ( 0, length( $data ) / 6 ); while( $left < $right ) { my $mid = int( ( $left + $right ) / 2 ); my( $key, $val ) = unpack 'NS', substr $data, $mid 6, 6; if( $key < $target ) { $left = $mid +1; } elsif( $key > $target ) { $right = $mid - 1; } elsif( $key == $target ) { return $val; } else { return; } } }` [download] In a quick test this achieved lookups 12,500 per second. (Let's see them do that with an RDBMS :) Notes: I would not be surprised if the above contains bugs it was thrown together. My test data was only 10e6 lines, so expect a little slower. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: fast lookups in files by citromatik (Curate) on Feb 05, 2008 at 15:48 UTC
Re^5: fast lookups in files by BrowserUk (Patriarch) on Feb 05, 2008 at 16:52 UTC
Re^5: fast lookups in files by BrowserUk (Patriarch) on Feb 05, 2008 at 16:06 UTC
Re: fast lookups in files by moritz (Cardinal) on Feb 05, 2008 at 12:32 UTC
If key and have fixed length (for example both 32 bit integers), you could write it into a binary and do a binary search. I think there is also a small unix tool that does line based binary search in files, but I can't remember the name. Update: It seems `look (1)` (in Debian shipped with in the bsdmainutils package) does the trick, but I'm not sure if it works with numerically sorted lines.	[reply] [d/l]
Re^2: fast lookups in files by bart (Canon) on Feb 05, 2008 at 13:01 UTC
...but I'm not sure if it works with numerically sorted lines. Then make a processed copy of the file, where you pad the numbers to the right to a fixed length, with sprintf, so they sort the same numerically and asciibetically.	[reply]
Re^3: fast lookups in files by moritz (Cardinal) on Feb 05, 2008 at 13:25 UTC
Right, but again it requires the size of the index to be known in advance - in which case you can use the binary data much more efficient.	[reply]
Re^4: fast lookups in files by bart (Canon) on Feb 05, 2008 at 15:58 UTC
Re: fast lookups in files by downer (Monk) on Feb 05, 2008 at 14:25 UTC
two things I would try: 1) compression. If you have any experience with bit coding, try to shrink things down a bit. I guess the best way to do this (the only way i've tried) is with inline C. This is very often a way to make data sets much more manageable, though it requires some low level tinkering. Many simple compression algorithms like variable-byte and rice coding are pretty easy to implement (if I can do it, then anyone can!). 2) I dont think binary sort is a good solution here. True there are O(log(n)) lookups which seems good, but if each of these is a disk read, then this is very slow, especially if the distribution of these look ups is very slow and can't benefit from caching. Rather, I would do the following: before your operation, go through the list, using tell every so often, say, after the leading digit in the first column changes. Some way that you have a manageable number of rows. Now create a hash that stores the starting # of bytes offset into the file, and the number of bytes in length in a hash, you can use storable to save this. When looking up a key-value pair, query this hash, then slurp up the associated chunk of the file. (only one disk op) Since this chunk is still sorted you can use binary search, or since it's in memory, a linear scan may be fine too.	[reply]
Re^2: fast lookups in files by Corion (Patriarch) on Feb 05, 2008 at 14:32 UTC
For the simple things, like packing a number into 4 bytes, there is `pack`, which does that, and many more things. The usual approach would be to pack all the offsets of the lines (or a better key) into one long string, as four bytes per line: `my $offsets = ''; while (<$fh>) { my $offset = tell $fh; $offsets .= pack 'N', $offset; };` [download] Whether to store the whole file or just the offset of every 10 lines is a matter of (low-level) optimization.	[reply] [d/l]
Re: fast lookups in files by Skeeve (Parson) on Feb 05, 2008 at 14:20 UTC
Instead of using a full relational database, how about using dbmopen? It binds a hash to a Berkley DB file. Unless you use "keys" or "values", you should be fine with that, I guess. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: fast lookups in files by mirod (Canon) on Feb 05, 2008 at 14:23 UTC
Well, this looks like a perfect job for GDBM_File: it ties a hash to a GDBM file. So you load the hash into the file, and then you can use it just as a regular hash, except it's on disk. That's probably the solution that will involve the less extra coding on your part.	[reply]
Re: fast lookups in files by locked_user sundialsvc4 (Abbot) on Feb 06, 2008 at 15:23 UTC
Since you know that the large-file is already sorted, the most efficient processing technique would be to sort the other input file(s) by the same key. Then, you can process the two streams side-by-side sequentially: no “search” is involved. If it is at all possible to do this, then this is what you should do. “Inconvenience yourself” to do it this way: you'll be glad you did. You do not have to worry about updating the original file. Write the changes to another file in the same format, then sort that file by the same key, then merge them into the original file. In each case, you're doing the job by means of sorts (which are unexpectedly fast), and sequential reads. When you're finished, you'll have the original input master-file (unchanged), the delta-file (now sorted), and the updated master-file. Yes, that is exactly how data-processing was done, using punched cards, long before digital computers were invented... And it worked. Failing that, an appropriate strategy would be to use something like a DB_File i(e.g./i a Berkeley DB) ... but beware. Random seeks are time-consuming in large quantities.
Re^2: fast lookups in files by BrowserUk (Patriarch) on Feb 06, 2008 at 16:17 UTC
... the most efficient processing technique would be to sort the other input file(s) by the same key. What other input files? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: fast lookups in files by Anonymous Monk on Feb 06, 2008 at 04:28 UTC
I have been faced with this problem in the past. What I did was I tied a perl hash on the file system (lookup DBI tie) That way the hash data structure was created on the disk as opposed to memory. Try that! Hanif	[reply]
Re: fast lookups in files by Vennis (Pilgrim) on Feb 06, 2008 at 17:10 UTC
If the file is sorted you might want to checkout the Search::Dict module. I found it to be amazing fast. I used it in my search engine 192283 Basically it works like my $newpos=look *FILE, $string, 0, 0; Q: Why did the Perlmonk cross the road? A: He wanted to escape the match.	[reply]