in reply to Searching text files

SQLite would be an excellent database alternative for your needs. 2-million phone numbers isn't all that big of a deal for a lightweight database like SQLite. And it's self contained; all you need is DBI and DBD::SQLite.

But I wonder if even a lightweight database is needed. What you could use is a series of area-code specific flat files. Maintain each flat file in sorted order. Put one phone number per line in the file, so that each line is the same length. That gives you a simple starting point for doing binary searches. If each line is twelve characters long (11 digit phone plus newline), you can use seek to read from offsets within the file.

The idea behind a binary search is to start at the halfway point in the file. If that record is too 'high', divide the first half into half and look there. If that's too low, divide the 2nd quarter into half and look there. ...and so on. There are lots of examples on the net for binary search algorithms. The good thing about a binary search is that it takes place in O(log n) average time, whereas a linear search (what you're doindg now) takes O(n) time without the complexity of a database.

...just another thought. But honestly, have we forgotton about our computer science 101 classes? A binary search of a fixed-width record length flat file is "how it's done" programatically. Using a database is convenient, but it shouldn't be the only option considered. As for slurping flat files, that doesn't scale well, and with 2 million (and growing) records, you need to think about that.


Dave

Replies are listed 'Best First'.
Re^2: Searching text files
by runrig (Abbot) on Sep 14, 2006 at 21:26 UTC
      Seems almost like overkill when it can be done in just a few lines. The function below searches three million records very quickly.
      # Usage: record_search($search_string, $filename, 11); # Returns the record number if the record is found, undef otherwise. # Searched file must be sorted. sub record_search { my ($key, $filename, $recordsize) = @_; my $fh = IO::File->new($filename, 'r') or croak("Could not open `$filename': $!"); my $hi = ($fh->stat)[7] / $recordsize; my $lo = 0; while ($lo < $hi) { my $mid = int(($lo + $hi) / 2); $fh->seek($mid * $recordsize, 0); my $target = $fh->getline(); chomp($target); if ($target < $key) { $lo = $mid + 1; } elsif ($target == $key) { return $mid; } else { $hi = $mid; } } }
      Update Forgot to mention: even though this function opens the file every time, it can successfully search 3 million records about 2000 times per second on a 3Ghz Pentium 4. I can't see much justification for anything more complicated. Threads? Bit strings? Crazy! :)
        Sometimes being able to do it in "just a few lines" isn't the point (take Data::Page for example). It's having to always rewrite, correctly, those lines every time you want to solve this particular problem. In this case, the module makes for a more flexible solution, and it's a "few lines" (about 12 actually) that I don't have to rewrite. The downside is that it is about 20% slower, but that's a non-issue when response to any single query is imperceptible.
        I can't see much justification for anything more complicated.


        Optionally searching for strings instead of numbers?
        Not having to have a fixed record size or not having to know what the recordsize is? (note: there is a set recordsize function in File::SortedSeek which claims to improve performance, but I could not see any difference in speed from using it)

        Anyway, here's your function using F:SS instead:

        use File::SortedSeek; sub record_search { my ($key, $filename, $recordsize) = @_; my $fh = IO::File->new($filename, 'r') or croak("Could not open $filename: $!"); # I don't see any difference in speed with the next line # File::SortedSeek::set_line_length($recordsize); numeric( $fh, $key); return File::SortedSeek::was_exact(); }

        Update: Looking at the answers below, I learned about Search::Dict, which would work just about as well, and has been in core perl since at least 5.6.1, but still supports my point about saving just a few lines.