in reply to Searching text files

Did you ever settle upon a solution?

For grins, I just ran a test that looked up 1000 randomly generated 10-digit telephone numbers (nnn-nnn-nnnn) in a flatfile database containing approximately 6.6% (2e6 / 3e7) of the 1e10 numbers:

c:\test>572961 9991230061 9991230061 is not found 9991230062 9991230062 is found 9991230063 9991230063 is not found Terminating on signal SIGINT(2) c:\test>perl -wle"printf qq[%03d%03d%04d\n], int( rand 1000 ), int( ra +nd 1000 ), int( rand 10000 ) for 1 .. 1e3" | perl 572961.pl >nul File for area code '000' not found at 572961.pl line 12, <STDIN> line +57. 999 trials of lookup (32.287s total), 32.319ms/trial

Each lookup takes around 33 ms which ought to be quick enough for most purposes.

The disk files (for all 999 possible area codes) require 10 GB, though that could trivially be reduced to 2.5 GB. Each area code is stored in a separate file, with one line of 10,000 characters for each of the 999 subarea codes; and each byte in the line representing a single telephone number by a simple '0' or '1'.

The lookup process is:

  1. Split the number into it's 3 component parts. (nnn-nnn-nnnn);
  2. Open the appropriate areacode file.
  3. Seek to the appropriate subarea line and read it.
  4. substr the appropriate byte of the line and it's value tells you whether the number is 'found' or 'not found'.

Care to trade 10 MB (2.5 MB) of diskspace per area code for 32 ms lookup time regardless of how the application grows?

#! perl -slw use strict; use Benchmark::Timer; my $T = new Benchmark::Timer; while( my $number = <STDIN> ) { chomp $number; $T->start( 'lookup' ); if( my( $area, $subarea, $no ) = $number =~ m[^(\d{3})(\d{3})(\d{4 +})$] ) { open FILE, '<', "./tele/$area" or warn "File for area code '$area' not found" and next; seek FILE, ( $subarea - 1 ) * 10002, 0; my $mask = <FILE>; print "$number is ", ( substr $mask, ( $no - 1 ), 1 ) ? 'found' : 'not found'; } else { print "Invalid telephone number: $number"; } $T->stop( 'lookup' ); } $T->report;

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.