When in doubt, use brute force!

The usual way of going about this is to fan the list out into many smaller files, and then identify the right file when you search.

The following code will split out a list of 250 000 words between 1 and 24 characters long into a three level directory structure in about 5 seconds on my machine. Words shorter than 3 are collected in a file at the top level of subdirs. For the rest, the words are stored in one file in the leaf directory.

#! /usr/local/bin/perl -w use strict; use warnings; use File::Path 'mkpath'; my %fd; my $root = 'cache'; $|=1; while (<>) { chomp; my @arr = split //, $_, 4; my $fd; if (@arr < 4) { if (!$fd{1}{$arr[0]}) { my $dir = "$root/$arr[0]"; mkpath $dir; open $fd{1}{$arr[0]}, '>', "$dir/list.txt" or die "open $d +ir/list.txt"; } $fd = $fd{1}{$arr[0]}; } else { if (!$fd{n}{$arr[0]}{$arr[1]}{$arr[2]}) { my $dir = "$root/$arr[0]/$arr[1]/$arr[2]"; mkpath $dir; open $fd{n}{$arr[0]}{$arr[1]}{$arr[2]}, '>', "$dir/list.tx +t" or die "open $dir/list.txt"; } $fd = $fd{n}{$arr[0]}{$arr[1]}{$arr[2]}; } print $fd "$_\n"; print "." unless $. % 5000; }

This is a bit wasteful; it would be better to hoist all the 'list' files up into the parent directory, but you get the idea (hey! it's free code). Also, if all your words are longer than the fan-out then matters are simplified.

It is then a simple matter of programming to take a word, figure out what file you should be looking at, and then see if the word found or not.

If you play around with the right fan-out so that no file is particularly big, you can slurp it into memory, and treat it as a string. Then you search for /\b$target\b and the regexp engine will run a fast Boyer-Moore search on the string. I have used newlines to separate the words, but there is nothing stopping you from using \x0 (binary zero) or some other sentinel.

If your words can use the all the byte patterns from 0 to 255 you'll have to do something a bit smarter.

You can also play with a cache of slurped files, and keep the last n files opened on hand. If there is some locality of reference in the search space, you'll save on throwing away a slurped file just to reopen it again on the next word.

• another intruder with the mooring in the heart of the Perl


In reply to Re: Best way to look-up a string amongst 100 million of its peers by grinder
in thread Best way to look-up a string amongst 100 million of its peers by anson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.