Re: Rapid text searches ( O(1) space and time)
by BrowserUk (Patriarch) on Jun 06, 2009 at 03:38 UTC
|
If, as your sample data suggests, the key numbers are strictly ascending, though not sequential, and your records are fixed length (or can easily be made so with a fast one-pass one-liner), then you can find your values very quickly (O(1) time and space!), by calculating the approximate position and then seeking forward or backward from that starting point.
The following crude demo code:
#! perl -sw
use 5.010;
use strict;
use Time::HiRes qw[ time ];
my $start = time;
my $recLen = 43;
open I, '<', '768941.dat' or die $!;
my $first = ( split ' ', scalar <I> )[ 0 ];
seek I, -$recLen, 2; ## SEEK_END
my $last = ( split ' ', scalar <I> )[ 0 ];
say "$first - $last";
my $recs = tell( I ) / $recLen;
my $factor = $recs / ( $last - $first );
my $success = 0;
while( my $target = <> ) {
chomp $target;
warn "Input outside range"
if $target > $last or $target < $first;
my $offset = int( ( $target - $first ) * $factor ) * $recLen;
my $rec;
seek I, $offset, 0 or die "Seek failed\n";
my $found = ( split ' ', $rec = <I> )[ 0 ];
if( $found != $target ) {
my $direction = $found < $target;
$offset = $direction ? $recLen : ( -$recLen * 2 );
seek I, $offset, 1 or die "Seek failed" ## SEEK_CUR
while ( $found = ( split ' ', $rec = <I> )[ 0 ] )
and $direction ? ( $found < $target ) : ( $found > $target
+ );
}
if( $found == $target ) {
print "Found: $rec";
++$success;
}
else {
warn "Target $target not found\n";
}
}
close I;
printf STDERR "Found $success records in %.3f seconds\n", time() - $st
+art;
Looks up 1000 (existing) records:
c:\test>wc -l 768941.search
1000 768941.search
c:\test>head 768941.search
1101077781160
1101077781161
1101077781162
1101077781163
1101077781164
1101077781165
1101077781166
1101077781167
1101077781168
1101077781169
in an approximation of your data file:
In two or three hundreths of a second:
c:\test>768941 768941.search >nul
Found 1000 records in 0.02 seconds
c:\test>768941 768941.search >nul
Found 1000 records in 0.03 seconds
c:\test>768941 768941.search >nul
Found 1000 records in 0.03 seconds
But that is with a warm cache. It took a whole 1/10th of a second the first time :) That's faster than you could make a connection to most databases.
The trouble with people who have sledgehammers, everything looks like a tough nut.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
That is hot and I am, as usual, amazed at your comp-sci chops. I cobbled together a DB_file example because I was curious if it would be slower.
It's not a 1:1 comparison because it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need. I only built my test file up to .75GB-ish because of lack of headroom on my drive (I only have 2GB left and I had to build two of these files so...).
Fake data generator-
my @what = qw( bothChaff bothDegen oneChaff oneDegen );
open my $fh, ">", "data.data" or die;
for ( 10000000 .. 99999999 )
{
next if rand(10) > 6.7;
print $fh sprintf("11010%d\t11010%d\t%s\n",
$_, $_+3333, $what[rand@what]);
last if -s $fh > 1_000 * 1_000 * 1_000;
}
Search + search DB builder-
use DB_File;
use strict;
use Time::HiRes qw[ time ];
my $start = time;
my %A;
tie %A, "DB_File", "db.file", O_CREAT|O_RDWR, 0666, $DB_HASH or die;
if ( $ARGV[0] eq "build" )
{
open my $fh, "<", "data.data" or die;
while (<$fh>)
{
chomp;
my ( $key, $val ) = split /\s+/, $_, 2;
$A{$key} = $val;
}
}
else
{
print $A{$ARGV[0]} || "nothing found", $/;
printf "Took %.2f seconds\n", time() - $start;
}
--
moo@cow[331]~/bin>pm-768941 1101078637800
1101078641133 oneChaff
Took 0.02 seconds
I got 0.02 seconds on almost every run and this is on a nine year-old computer and, I think, a five year old copy of the related Berekely binaries. I turned the printf to .3f and it was mostly ranging from 0.016 to 0.019 seconds. A caveat being that it took a loooooooong time to build the DB file. I wasn't clocking it but it was edging up on an hour.
<rejoinder type="in good fun!">
The trouble with engineers who are smarter than most everyone else is that given Y they solve for X where X is a more interesting problem than Y.
</rejoinder>
| [reply] [d/l] [select] |
|
|
it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need.
I think the OP just needs to find the appropriate records in a timely fashion; whether that's done by indexing the file, or searching the file doesn't really matter. Though the hour to build the index could be critical if the file changes with any frequency. If say, it's the output from some other process or system.
A rather more important difference (assuming I'm reading your code and console log correctly), is that your 0.016 - 0.019 seconds is the time taken to lookup one value.
My timings:
c:\test>768941 768941.search >nul
Found 1000 records in 0.032 seconds
c:\test>768941 768941.search >nul
Found 1000 records in 0.020 seconds
c:\test>768941 768941.search >nul
Found 1000 records in 0.036 seconds
c:\test>768941 768941.search >nul
Found 1000 records in 0.020 seconds
c:\test>768941 768941.search >nul
Found 1000 records in 0.023 seconds
are the time to find 1000 values!
given Y they solve for X where X is a more interesting problem than Y.
Given 1000 keys, he wanted to find the corresponding values in a file containing ~38e6 pairings as quickly as possible. That's all my code does. I don't understand why you think I'm doing something different?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
|
|
It's been long since I learned the big O notation but IMHO your code can only be O(1) if and only if there are no gaps within the data.
Just consider the worst case of IDs within (1, (1/2 * m) .. m) i.e. (n= 1/2 m + 2) and your looking for the second entry x=m/2 right after the maximum possible gap. You'll start with approximating 3/4 m and continue to seek linearly till m/2. That are 1/2 n steps which means O(n).
If you improve your code to recursively interpolating the next position between the narrowed range instead going linearly you'll make it with O(log(n)) like in binary search.
Furthermore you're only comparing the entries from the first column aren't you?
| [reply] [d/l] |
|
|
The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).
For your "worst case scenario to be so, it would require the file to contain 38e6-1 consequetive numbers and a single outlier 38e6 away from one end. My conclusion was that is an unlikely scenario for real data, My assumption is for a reasonably even distribution of the values through the range.
Furthermore you're only comparing the entries from the first column aren't you?
I do not understand the question?
I take the ops "Let say I need to know the pair for 1101077781160" to mean that given the input '1101077781160', he requires the other two values on that line. Eg. '1101077783656 bothChaff'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
|
|
|
|
As promised I did a brute force calculation for the odds that this algorithm is fast.
Extrapolating the example from the OP in a 1.5 GB File and randomly skippping 1 out of 4 lines it seems you'll miss the searched line in average for more than 1200 entries before you start searching linearly.
I stopped after 2 test runs since my laptop with 1 GB RAM was heavily swapping for 30 minutes then* ... I'll post the completed results tomorrow morning. : )
Temporary conclusion, over 1200 seeks doesn't even look close to O(1) for me, could it be your testdata was a little optimistic? Hope you understand my critic now.
Please Note: This is the very optmistic best case for only a 25% gap probability. I didn't even try to insert any worsening conditions now!
That much about the normal average case.
(*) I could boost the testperformance by avoiding to build up the @data array and just approximating the accuracy of the interpolation within the loop, but I'm already too bored now!
CODE:
OUTPUT:
| [reply] [d/l] [select] |
Re: Rapid text searches
by Your Mother (Archbishop) on Jun 05, 2009 at 22:47 UTC
|
If this is a write once / read many situation, I'd recommend Berkeley DBs; BerkeleyDB, DB_File. It would be mildly quite slow to build but after that look-ups are, as a coworker used to say, greasy-fast.
| [reply] |
Re: Rapid text searches
by GrandFather (Saint) on Jun 05, 2009 at 23:30 UTC
|
Your Mother's suggestion of using a database is likely to be the best answer, but where you want to speed up a search for multiple records using the text file you can do something like:
use strict;
use warnings;
my @wanted = qw(1101077781172 1101077781180 1101077781183);
my %match = map {$_ => 1} @wanted;
while (<DATA>) {
chomp;
my (@fields) = split;
next unless @fields >= 3;
print "$_\n" if exists $match{$fields[0]};
}
__DATA__
1101077781160 1101077783656 bothChaff
1101077781161 1101077783657 bothChaff
1101077781162 1101077783658 bothChaff
...
Prints:
1101077781172 1101077783668 bothChaff
1101077781180 1101077783676 oneDegen
True laziness is hard work
| [reply] [d/l] [select] |
Re: Rapid text searches
by LanX (Saint) on Jun 06, 2009 at 00:23 UTC
|
OK, let me describe how I see your problem, you have consecutive IDs in two columns and your looking for the lines where a special ID occurs in one of the columns?
I assume it's more complicated than in your example and gaps are possible, so you can't just trivially calculate the right lines to look for?
And a binary search is either no option?
With a strict ordering, it's like a telephone book, no need to start reading from the first page on, first look for the right city and the starting letters.
I would build up a hash of hashes for indexing ranges of line numbers for each column.
with 1101077781160 you'll look up $hash1{11010} where you get a second hash to lookup $hash2{77781} telling you the range to search for "160".
The firstlevel hash (the city) should have a size thats easily kept in RAM, the second level hashes should be loaded on demand (and some - maybe the last hundred - kept cached).
Of course you could have more levels of hashes and I'm not sure about the best way to make hashes persistent and quickly loaded, but this are CPAN-details.
| [reply] |
Re: Rapid text searches
by JavaFan (Canon) on Jun 06, 2009 at 16:32 UTC
|
It seems that you want to search for exact matches columns. If you know all the things you are looking for before hand, I would create a hash with the items you are searching for. So for instance if you are looking for 1101077781160, 1101077783669, and 1101077781182, I'd start off with
my %queries;
$queries{$_} = 1 for qw(1101077781160 1101077783669 1101077781182);
Then I'd iterate over the file, split each line into fields, and test each field against the hash previously build.
This avoid multiple passes over the file (as you would with grep), or loading in the entire file into a hash.
And if I only looking for a single entry, perl script will still load everything into memory before it could look for the string I want?
You tell us instead of asking! You wrote the code. If you wrote your code it first slurps in the entire file, then the answer is yes. If you wrote your code it didn't, then the answer is no. | [reply] [d/l] |