comment on

Did you ever settle upon a solution?

For grins, I just ran a test that looked up 1000 randomly generated 10-digit telephone numbers (nnn-nnn-nnnn) in a flatfile database containing approximately 6.6% (2e6 / 3e7) of the 1e10 numbers:

c:\test>572961
9991230061
9991230061 is not found
9991230062
9991230062 is found
9991230063
9991230063 is not found
Terminating on signal SIGINT(2)

c:\test>perl -wle"printf qq[%03d%03d%04d\n], int( rand 1000 ), int( ra
+nd 1000 ), int( rand 10000 ) for 1 .. 1e3" | perl 572961.pl >nul
File for area code '000' not found at 572961.pl line 12, <STDIN> line 
+57.
999 trials of lookup (32.287s total), 32.319ms/trial
[download]

Each lookup takes around 33 ms which ought to be quick enough for most purposes.

The disk files (for all 999 possible area codes) require 10 GB, though that could trivially be reduced to 2.5 GB. Each area code is stored in a separate file, with one line of 10,000 characters for each of the 999 subarea codes; and each byte in the line representing a single telephone number by a simple '0' or '1'.

The lookup process is:

Split the number into it's 3 component parts. (nnn-nnn-nnnn);
Open the appropriate areacode file.
Seek to the appropriate subarea line and read it.
substr the appropriate byte of the line and it's value tells you whether the number is 'found' or 'not found'.

Care to trade 10 MB (2.5 MB) of diskspace per area code for 32 ms lookup time regardless of how the application grows?

#! perl -slw
use strict;
use Benchmark::Timer;

my $T = new Benchmark::Timer;


while( my $number = <STDIN> ) {
    chomp $number;
    $T->start( 'lookup' );
    if( my( $area, $subarea, $no ) = $number =~ m[^(\d{3})(\d{3})(\d{4
+})$] ) {
        open FILE, '<', "./tele/$area" 
            or warn "File for area code '$area' not found" and next;
        seek FILE, ( $subarea - 1 ) * 10002, 0;
        my $mask = <FILE>;
        print "$number is ", ( substr $mask, ( $no - 1 ), 1 ) 
            ? 'found' : 'not found';
    }
    else {
        print "Invalid telephone number: $number";
    }
    $T->stop( 'lookup' );
}

$T->report;
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Searching text files by BrowserUk
in thread Searching text files by SteveS832001

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.