anson has asked for the wisdom of the Perl Monks concerning the following question:

Say I have a list of 100 million words. I want to QUICKLY query that list to see if a word exists in it. Anyone have some ideas about the best way to do this? A key thing here is speed. A database seems overkill, since all I want to do is a look-up and not manipulate the data or relate it to other things. I probably can't keep the whole thing in memory as the words can be up to 64 characters long. Need some brainstorming!
  • Comment on Best way to look-up a string amongst 100 million of its peers

Replies are listed 'Best First'.
Re: Best way to look-up a string amongst 100 million of its peers
by moritz (Cardinal) on Mar 25, 2008 at 20:36 UTC
    Of course the solution strongly depends on your exact problem (how long are the words? how often do they change? how much time can you invest in building an index? how often do you run queries?).

    One thing you could do is a sorted file of fixed length records in which you perform a binary search.

    You can use a hash to cache the most often used search terms, and a index for the first 2**16 (or whatever) positions in the file that a binary search would visit.

    Or you just use a database engine in the hope that it implements the index very efficiently.

    (I seem to recall that we had a similar question recently, I'll see if I can find the link). Update: here it is: fast lookups in files. It has some very good answers and is worth a read.

Re: Best way to look-up a string amongst 100 million of its peers
by CountZero (Bishop) on Mar 25, 2008 at 20:28 UTC
    For speed nothing will beat a hash, BUT keeping such big hash in memory seems difficult and of course loading the hash from disk will take its time as well.

    If you need to check this list repeatedly, in the long run you will be better of with a database (nicely indexed of course!).

    Now, one the other hand, 100 million words? I don't think there are that many words existing so you will have words being repeated and far less unique entries than 100 million, say perhaps only a few 100 thousand and then a hash will be a good idea.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Best way to look-up a string amongst 100 million of its peers
by BrowserUk (Patriarch) on Mar 26, 2008 at 06:16 UTC

    Not enough information.

    • What range of characters are they constructed from?
    • What are you looking up?

      Just the existance in your list or is there some associated piece of data?

    • How will the lookups be organised?

      Will they be completely random? Or will they come in related bunches?

    • What kind of app is doing the looking?

      A single instance of a long running application doing thoudands of lookups per run.

      Many concurrent instances doing one or a dozen lookups per invocation; eg. a CGI?

    • How fast are you looking to achieve?

      The obvious answer is "as fast as possible", but for example, in a cgi app, there are inherent network delays of up to a few seconds, so if (the one or dozen) lookups per run takes milliseconds rather microseconds, your users will not notice.

      On the other hand, if the user has to wait for 50,000 lookups, the difference would be significant.

    • What format is the data currently in?

      Can that format change?

    • Is the list static, or nearly static, or changing hourly, daily, weekly?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for all the advice.

      I had originally tried MySQL and loading the list was fast enough (load data infile...) but I gave up on the indexing process after 24 hours.

      On your recommendation I have managed to get the list into CDB and the performance of that seems pretty decent! I also love that I don't need to be running a database for such a simple look-up.

      To answer all your questions about the nature of the data... just go to the URL where I just now got this up and running ;)

      https://domize.com

        I've got to say, that's a pretty nifty little tool :)

        One suggestion is to make it only lookup after a change if the field hasn't changed again in the last half second or so. I tried a few domains. Some results weren't exactly right, but it's a cool little script nonetheless.

Re: Best way to look-up a string amongst 100 million of its peers
by stiller (Friar) on Mar 25, 2008 at 20:27 UTC
    This is a kind of problem that lends itself to clever hacks, but finding the right hack would require some more info on the nature of the strings. It's not human readable words is it?

    The simplest thing that could possible work would be to create a hash tied to a file and read it all into that. That is so simple that you can very quickly assess whether that solution is fast enough or not.

Re: Best way to look-up a string amongst 100 million of its peers
by kyle (Abbot) on Mar 25, 2008 at 20:32 UTC

    Something like DBM::Deep or even plain old DB_File should be able to handle this pretty well. Read the list in once, create the hash on disk, and then you can query it fairly quickly without ever having to read the original list again.

      DBM::Deep is definetly the wrong choice
Re: Best way to look-up a string amongst 100 million of its peers
by Corion (Patriarch) on Mar 25, 2008 at 20:34 UTC

    If your text file is constant, I guess that CDB_File is a possible solution if you have more diskspace than RAM. If you have a cluster of machines, most likely Cache::Memcached is a good solution, because it lets you keep the whole file (distributed) in RAM and search it relatively quickly.

      BerkeleyDB is actually faster than CDB_File, as seen in previous benchmarks on PerlMonks. I wouldn't recommend using memcached for anything where lost data would be considered a problem.

        I only found SQLite vs CDB_File vs BerkeleyDB, and there CDB_File came out on top of BerkeleyDB where it entered the contest. On the other hand, that benchmark is from 2002, and CDB_File didn't make it into the second round due to build problems on Windows.

Re: Best way to look-up a string amongst 100 million of its peers
by akho (Hermit) on Mar 25, 2008 at 20:58 UTC
    Bloom filters may be of use; you will need to check all positive answers in some other way (or rely on probabilities), but they may speed up the process.

    A CPAN search yields several implementations, yet I am not familiar with any of them.

Re: Best way to look-up a string amongst 100 million of its peers
by grinder (Bishop) on Mar 26, 2008 at 12:10 UTC

    When in doubt, use brute force!

    The usual way of going about this is to fan the list out into many smaller files, and then identify the right file when you search.

    The following code will split out a list of 250 000 words between 1 and 24 characters long into a three level directory structure in about 5 seconds on my machine. Words shorter than 3 are collected in a file at the top level of subdirs. For the rest, the words are stored in one file in the leaf directory.

    #! /usr/local/bin/perl -w use strict; use warnings; use File::Path 'mkpath'; my %fd; my $root = 'cache'; $|=1; while (<>) { chomp; my @arr = split //, $_, 4; my $fd; if (@arr < 4) { if (!$fd{1}{$arr[0]}) { my $dir = "$root/$arr[0]"; mkpath $dir; open $fd{1}{$arr[0]}, '>', "$dir/list.txt" or die "open $d +ir/list.txt"; } $fd = $fd{1}{$arr[0]}; } else { if (!$fd{n}{$arr[0]}{$arr[1]}{$arr[2]}) { my $dir = "$root/$arr[0]/$arr[1]/$arr[2]"; mkpath $dir; open $fd{n}{$arr[0]}{$arr[1]}{$arr[2]}, '>', "$dir/list.tx +t" or die "open $dir/list.txt"; } $fd = $fd{n}{$arr[0]}{$arr[1]}{$arr[2]}; } print $fd "$_\n"; print "." unless $. % 5000; }

    This is a bit wasteful; it would be better to hoist all the 'list' files up into the parent directory, but you get the idea (hey! it's free code). Also, if all your words are longer than the fan-out then matters are simplified.

    It is then a simple matter of programming to take a word, figure out what file you should be looking at, and then see if the word found or not.

    If you play around with the right fan-out so that no file is particularly big, you can slurp it into memory, and treat it as a string. Then you search for /\b$target\b and the regexp engine will run a fast Boyer-Moore search on the string. I have used newlines to separate the words, but there is nothing stopping you from using \x0 (binary zero) or some other sentinel.

    If your words can use the all the byte patterns from 0 to 255 you'll have to do something a bit smarter.

    You can also play with a cache of slurped files, and keep the last n files opened on hand. If there is some locality of reference in the search space, you'll save on throwing away a slurped file just to reopen it again on the next word.

    • another intruder with the mooring in the heart of the Perl

Re: Best way to look-up a string amongst 100 million of its peers
by zentara (Cardinal) on Mar 26, 2008 at 12:27 UTC
    The simplest thing (assuming you are talking about ascii words) is to break up the 100 million list, into 24 (or whatever is suitable for size) smaller lists that all begin with the same 1 (or 2) letters. This would not take much more disk space, but searching them would be simpler and faster. All you need to do is regex the word for the first 1 (or 2) letters, then go loop thru the file containing those words. Of course, alot of speed improvements can be made by combining things like "Q,W,X,Y,Z" into one file, and breaking things like "C" into "CA..CL..CR" etc.; depending on the initial letter frequency of your 100 million list.

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
      The simplest thing is to reinvent b-trees one step at a time?

      -sam

        How is a b-tree simpler than breaking into smaller sorted files? I don't even know what a b-tree is.... they run me out of college before we got to that. :-) But I do know how to sort and split files.

        I'm not really a human, but I play one on earth. Cogito ergo sum a bum