Best way to look-up a string amongst 100 million of its peers

anson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Best way to look-up a string amongst 100 million of its peers by moritz (Cardinal) on Mar 25, 2008 at 20:36 UTC
Of course the solution strongly depends on your exact problem (how long are the words? how often do they change? how much time can you invest in building an index? how often do you run queries?). One thing you could do is a sorted file of fixed length records in which you perform a binary search. You can use a hash to cache the most often used search terms, and a index for the first 2**16 (or whatever) positions in the file that a binary search would visit. Or you just use a database engine in the hope that it implements the index very efficiently. (I seem to recall that we had a similar question recently, I'll see if I can find the link). Update: here it is: fast lookups in files. It has some very good answers and is worth a read.	[reply]
Re: Best way to look-up a string amongst 100 million of its peers by CountZero (Bishop) on Mar 25, 2008 at 20:28 UTC
For speed nothing will beat a hash, BUT keeping such big hash in memory seems difficult and of course loading the hash from disk will take its time as well. If you need to check this list repeatedly, in the long run you will be better of with a database (nicely indexed of course!). Now, one the other hand, 100 million words? I don't think there are that many words existing so you will have words being repeated and far less unique entries than 100 million, say perhaps only a few 100 thousand and then a hash will be a good idea. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: Best way to look-up a string amongst 100 million of its peers by BrowserUk (Patriarch) on Mar 26, 2008 at 06:16 UTC
Not enough information. What range of characters are they constructed from? What are you looking up? Just the existance in your list or is there some associated piece of data? How will the lookups be organised? Will they be completely random? Or will they come in related bunches? What kind of app is doing the looking? A single instance of a long running application doing thoudands of lookups per run. Many concurrent instances doing one or a dozen lookups per invocation; eg. a CGI? How fast are you looking to achieve? The obvious answer is "as fast as possible", but for example, in a cgi app, there are inherent network delays of up to a few seconds, so if (the one or dozen) lookups per run takes milliseconds rather microseconds, your users will not notice. On the other hand, if the user has to wait for 50,000 lookups, the difference would be significant. What format is the data currently in? Can that format change? Is the list static, or nearly static, or changing hourly, daily, weekly? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Best way to look-up a string amongst 100 million of its peers by anson (Initiate) on Mar 26, 2008 at 16:38 UTC
Thanks for all the advice. I had originally tried MySQL and loading the list was fast enough (load data infile...) but I gave up on the indexing process after 24 hours. On your recommendation I have managed to get the list into CDB and the performance of that seems pretty decent! I also love that I don't need to be running a database for such a simple look-up. To answer all your questions about the nature of the data... just go to the URL where I just now got this up and running ;) https://domize.com	[reply]
Re^3: Best way to look-up a string amongst 100 million of its peers by assemble (Friar) on Mar 26, 2008 at 17:40 UTC
I've got to say, that's a pretty nifty little tool :) One suggestion is to make it only lookup after a change if the field hasn't changed again in the last half second or so. I tried a few domains. Some results weren't exactly right, but it's a cool little script nonetheless.	[reply]
Re: Best way to look-up a string amongst 100 million of its peers by stiller (Friar) on Mar 25, 2008 at 20:27 UTC
This is a kind of problem that lends itself to clever hacks, but finding the right hack would require some more info on the nature of the strings. It's not human readable words is it? The simplest thing that could possible work would be to create a hash tied to a file and read it all into that. That is so simple that you can very quickly assess whether that solution is fast enough or not.	[reply]
Re: Best way to look-up a string amongst 100 million of its peers by kyle (Abbot) on Mar 25, 2008 at 20:32 UTC
Something like DBM::Deep or even plain old DB_File should be able to handle this pretty well. Read the list in once, create the hash on disk, and then you can query it fairly quickly without ever having to read the original list again.	[reply]
Re^2: Best way to look-up a string amongst 100 million of its peers by Anonymous Monk on Mar 26, 2008 at 06:42 UTC
DBM::Deep is definetly the wrong choice	[reply]
Re^3: Best way to look-up a string amongst 100 million of its peers by BrowserUk (Patriarch) on Mar 26, 2008 at 07:30 UTC
I'm not saying you are wrong, but would you care to elaborate why? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: Best way to look-up a string amongst 100 million of its peers by Corion (Patriarch) on Mar 25, 2008 at 20:34 UTC
If your text file is constant, I guess that CDB_File is a possible solution if you have more diskspace than RAM. If you have a cluster of machines, most likely Cache::Memcached is a good solution, because it lets you keep the whole file (distributed) in RAM and search it relatively quickly.	[reply]
Re^2: Best way to look-up a string amongst 100 million of its peers by perrin (Chancellor) on Mar 25, 2008 at 20:41 UTC
BerkeleyDB is actually faster than CDB_File, as seen in previous benchmarks on PerlMonks. I wouldn't recommend using memcached for anything where lost data would be considered a problem.	[reply]
Re^3: Best way to look-up a string amongst 100 million of its peers by Corion (Patriarch) on Mar 26, 2008 at 07:13 UTC
I only found SQLite vs CDB_File vs BerkeleyDB, and there CDB_File came out on top of BerkeleyDB where it entered the contest. On the other hand, that benchmark is from 2002, and CDB_File didn't make it into the second round due to build problems on Windows.	[reply]
Re^4: Best way to look-up a string amongst 100 million of its peers by perrin (Chancellor) on Mar 26, 2008 at 14:59 UTC
Re: Best way to look-up a string amongst 100 million of its peers by akho (Hermit) on Mar 25, 2008 at 20:58 UTC
Bloom filters may be of use; you will need to check all positive answers in some other way (or rely on probabilities), but they may speed up the process. A CPAN search yields several implementations, yet I am not familiar with any of them.	[reply]
Re: Best way to look-up a string amongst 100 million of its peers by grinder (Bishop) on Mar 26, 2008 at 12:10 UTC
When in doubt, use brute force! The usual way of going about this is to fan the list out into many smaller files, and then identify the right file when you search. The following code will split out a list of 250 000 words between 1 and 24 characters long into a three level directory structure in about 5 seconds on my machine. Words shorter than 3 are collected in a file at the top level of subdirs. For the rest, the words are stored in one file in the leaf directory. #! /usr/local/bin/perl -w use strict; use warnings; use File::Path 'mkpath'; my %fd; my $root = 'cache'; $\|=1; while (<>) { chomp; my @arr = split //, $_, 4; my $fd; if (@arr < 4) { if (!$fd{1}{$arr[0]}) { my $dir = "$root/$arr[0]"; mkpath $dir; open $fd{1}{$arr[0]}, '>', "$dir/list.txt" or die "open $d +ir/list.txt"; } $fd = $fd{1}{$arr[0]}; } else { if (!$fd{n}{$arr[0]}{$arr[1]}{$arr[2]}) { my $dir = "$root/$arr[0]/$arr[1]/$arr[2]"; mkpath $dir; open $fd{n}{$arr[0]}{$arr[1]}{$arr[2]}, '>', "$dir/list.tx +t" or die "open $dir/list.txt"; } $fd = $fd{n}{$arr[0]}{$arr[1]}{$arr[2]}; } print $fd "$_\n"; print "." unless $. % 5000; } [download] This is a bit wasteful; it would be better to hoist all the 'list' files up into the parent directory, but you get the idea (hey! it's free code). Also, if all your words are longer than the fan-out then matters are simplified. It is then a simple matter of programming to take a word, figure out what file you should be looking at, and then see if the word found or not. If you play around with the right fan-out so that no file is particularly big, you can slurp it into memory, and treat it as a string. Then you search for `/\b$target\b` and the regexp engine will run a fast Boyer-Moore search on the string. I have used newlines to separate the words, but there is nothing stopping you from using \x0 (binary zero) or some other sentinel. If your words can use the all the byte patterns from 0 to 255 you'll have to do something a bit smarter. You can also play with a cache of slurped files, and keep the last n files opened on hand. If there is some locality of reference in the search space, you'll save on throwing away a slurped file just to reopen it again on the next word. • another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re: Best way to look-up a string amongst 100 million of its peers by zentara (Cardinal) on Mar 26, 2008 at 12:27 UTC
The simplest thing (assuming you are talking about ascii words) is to break up the 100 million list, into 24 (or whatever is suitable for size) smaller lists that all begin with the same 1 (or 2) letters. This would not take much more disk space, but searching them would be simpler and faster. All you need to do is regex the word for the first 1 (or 2) letters, then go loop thru the file containing those words. Of course, alot of speed improvements can be made by combining things like "Q,W,X,Y,Z" into one file, and breaking things like "C" into "CA..CL..CR" etc.; depending on the initial letter frequency of your 100 million list. I'm not really a human, but I play one on earth. Cogito ergo sum a bum	[reply]
Re^2: Best way to look-up a string amongst 100 million of its peers by samtregar (Abbot) on Mar 26, 2008 at 17:33 UTC
The simplest thing is to reinvent b-trees one step at a time? -sam	[reply]
Re^3: Best way to look-up a string amongst 100 million of its peers by zentara (Cardinal) on Mar 26, 2008 at 18:36 UTC
How is a b-tree simpler than breaking into smaller sorted files? I don't even know what a b-tree is.... they run me out of college before we got to that. :-) But I do know how to sort and split files. I'm not really a human, but I play one on earth. Cogito ergo sum a bum	[reply]
Re^4: Best way to look-up a string amongst 100 million of its peers by BrowserUk (Patriarch) on Mar 26, 2008 at 19:12 UTC