Rapid text searches

joomanji has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Rapid text searches ( O(1) space and time) by BrowserUk (Patriarch) on Jun 06, 2009 at 03:38 UTC
If, as your sample data suggests, the key numbers are strictly ascending, though not sequential, and your records are fixed length (or can easily be made so with a fast one-pass one-liner), then you can find your values very quickly (O(1) time and space!), by calculating the approximate position and then seeking forward or backward from that starting point. The following crude demo code: #! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; my $start = time; my $recLen = 43; open I, '<', '768941.dat' or die $!; my $first = ( split ' ', scalar <I> )[ 0 ]; seek I, -$recLen, 2; ## SEEK_END my $last = ( split ' ', scalar <I> )[ 0 ]; say "$first - $last"; my $recs = tell( I ) / $recLen; my $factor = $recs / ( $last - $first ); my $success = 0; while( my $target = <> ) { chomp $target; warn "Input outside range" if $target > $last or $target < $first; my $offset = int( ( $target - $first ) * $factor ) * $recLen; my $rec; seek I, $offset, 0 or die "Seek failed\n"; my $found = ( split ' ', $rec = <I> )[ 0 ]; if( $found != $target ) { my $direction = $found < $target; $offset = $direction ? $recLen : ( -$recLen * 2 ); seek I, $offset, 1 or die "Seek failed" ## SEEK_CUR while ( $found = ( split ' ', $rec = <I> )[ 0 ] ) and $direction ? ( $found < $target ) : ( $found > $target + ); } if( $found == $target ) { print "Found: $rec"; ++$success; } else { warn "Target $target not found\n"; } } close I; printf STDERR "Found $success records in %.3f seconds\n", time() - $st +art; [download] Looks up 1000 (existing) records: `c:\test>wc -l 768941.search 1000 768941.search c:\test>head 768941.search 1101077781160 1101077781161 1101077781162 1101077781163 1101077781164 1101077781165 1101077781166 1101077781167 1101077781168 1101077781169` [download] in an approximation of your data file: Read more... (776 Bytes) In two or three hundreths of a second: `c:\test>768941 768941.search >nul Found 1000 records in 0.02 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds` [download] But that is with a warm cache. It took a whole 1/10th of a second the first time :) That's faster than you could make a connection to most databases. The trouble with people who have sledgehammers, everything looks like a tough nut. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Rapid text searches ( O(1) space and time) by Your Mother (Archbishop) on Jun 06, 2009 at 06:07 UTC
That is hot and I am, as usual, amazed at your comp-sci chops. I cobbled together a DB_file example because I was curious if it would be slower. It's not a 1:1 comparison because it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need. I only built my test file up to .75GB-ish because of lack of headroom on my drive (I only have 2GB left and I had to build two of these files so...). Fake data generator- `my @what = qw( bothChaff bothDegen oneChaff oneDegen ); open my $fh, ">", "data.data" or die; for ( 10000000 .. 99999999 ) { next if rand(10) > 6.7; print $fh sprintf("11010%d\t11010%d\t%s\n", $_, $_+3333, $what[rand@what]); last if -s $fh > 1_000 * 1_000 * 1_000; }` [download] Search + search DB builder- `use DB_File; use strict; use Time::HiRes qw[ time ]; my $start = time; my %A; tie %A, "DB_File", "db.file", O_CREAT\|O_RDWR, 0666, $DB_HASH or die; if ( $ARGV[0] eq "build" ) { open my $fh, "<", "data.data" or die; while (<$fh>) { chomp; my ( $key, $val ) = split /\s+/, $_, 2; $A{$key} = $val; } } else { print $A{$ARGV[0]} \|\| "nothing found", $/; printf "Took %.2f seconds\n", time() - $start; } -- moo@cow[331]~/bin>pm-768941 1101078637800 1101078641133 oneChaff Took 0.02 seconds` [download] I got 0.02 seconds on almost every run and this is on a nine year-old computer and, I think, a five year old copy of the related Berekely binaries. I turned the printf to .3f and it was mostly ranging from 0.016 to 0.019 seconds. A caveat being that it took a loooooooong time to build the DB file. I wasn't clocking it but it was edging up on an hour. `<rejoinder type="in good fun!">` The trouble with engineers who are smarter than most everyone else is that given Y they solve for X where X is a more interesting problem than Y. `</rejoinder>`	[reply] [d/l] [select]
Re^3: Rapid text searches ( O(1) space and time) by BrowserUk (Patriarch) on Jun 06, 2009 at 07:09 UTC
it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need. I think the OP just needs to find the appropriate records in a timely fashion; whether that's done by indexing the file, or searching the file doesn't really matter. Though the hour to build the index could be critical if the file changes with any frequency. If say, it's the output from some other process or system. A rather more important difference (assuming I'm reading your code and console log correctly), is that your 0.016 - 0.019 seconds is the time taken to lookup one value. My timings: `c:\test>768941 768941.search >nul Found 1000 records in 0.032 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.036 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.023 seconds` [download] are the time to find 1000 values! given Y they solve for X where X is a more interesting problem than Y. Given 1000 keys, he wanted to find the corresponding values in a file containing ~38e6 pairings as quickly as possible. That's all my code does. I don't understand why you think I'm doing something different? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: Rapid text searches ( O(1) space and time) by Your Mother (Archbishop) on Jun 06, 2009 at 15:59 UTC
Re^2: Rapid text searches ( O(log n) time) by LanX (Saint) on Jun 06, 2009 at 20:20 UTC
It's been long since I learned the big O notation but IMHO your code can only be O(1) if and only if there are no gaps within the data. Just consider the worst case of IDs within `(1, (1/2 * m) .. m)` i.e. (n= 1/2 m + 2) and your looking for the second entry x=m/2 right after the maximum possible gap. You'll start with approximating 3/4 m and continue to seek linearly till m/2. That are 1/2 n steps which means O(n). If you improve your code to recursively interpolating the next position between the narrowed range instead going linearly you'll make it with O(log(n)) like in binary search. Furthermore you're only comparing the entries from the first column aren't you? Cheers Rolf	[reply] [d/l]
Re^3: Rapid text searches ( O(log n) time) by BrowserUk (Patriarch) on Jun 06, 2009 at 22:08 UTC
The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1). For your "worst case scenario to be so, it would require the file to contain 38e6-1 consequetive numbers and a single outlier 38e6 away from one end. My conclusion was that is an unlikely scenario for real data, My assumption is for a reasonably even distribution of the values through the range. Furthermore you're only comparing the entries from the first column aren't you? I do not understand the question? I take the ops "Let say I need to know the pair for 1101077781160" to mean that given the input '1101077781160', he requires the other two values on that line. Eg. '1101077783656 bothChaff' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: Rapid text searches ( O(log n) time) by LanX (Saint) on Jun 07, 2009 at 01:35 UTC
Re^5: Rapid text searches ( O(log n) time) by BrowserUk (Patriarch) on Jun 07, 2009 at 08:14 UTC
Some notes below your chosen depth have not been shown here
Re^4: Rapid text searches ( O(log n) time) by JavaFan (Canon) on Jun 10, 2009 at 09:21 UTC
Re^2: Rapid text searches (complexity test) by LanX (Saint) on Jun 12, 2009 at 13:10 UTC
As promised I did a brute force calculation for the odds that this algorithm is fast. Extrapolating the example from the OP in a 1.5 GB File and randomly skippping 1 out of 4 lines it seems you'll miss the searched line in average for more than 1200 entries before you start searching linearly. I stopped after 2 test runs since my laptop with 1 GB RAM was heavily swapping for 30 minutes then* ... I'll post the completed results tomorrow morning. : ) Temporary conclusion, over 1200 seeks doesn't even look close to O(1) for me, could it be your testdata was a little optimistic? Hope you understand my critic now. Please Note: This is the very optmistic best case for only a 25% gap probability. I didn't even try to insert any worsening conditions now! That much about the normal average case. Cheers Rolf (*) I could boost the testperformance by avoiding to build up the @data array and just approximating the accuracy of the interpolation within the loop, but I'm already too bored now! CODE: Read more... (3 kB) OUTPUT: Read more... (2 kB)	[reply] [d/l] [select]
Re: Rapid text searches by Your Mother (Archbishop) on Jun 05, 2009 at 22:47 UTC
If this is a write once / read many situation, I'd recommend Berkeley DBs; BerkeleyDB, DB_File. It would be ~~mildly~~ quite slow to build but after that look-ups are, as a coworker used to say, greasy-fast.	[reply]
Re: Rapid text searches by GrandFather (Saint) on Jun 05, 2009 at 23:30 UTC
Your Mother's suggestion of using a database is likely to be the best answer, but where you want to speed up a search for multiple records using the text file you can do something like: `use strict; use warnings; my @wanted = qw(1101077781172 1101077781180 1101077781183); my %match = map {$_ => 1} @wanted; while (<DATA>) { chomp; my (@fields) = split; next unless @fields >= 3; print "$_\n" if exists $match{$fields[0]}; } __DATA__ 1101077781160 1101077783656 bothChaff 1101077781161 1101077783657 bothChaff 1101077781162 1101077783658 bothChaff ...` [download] Prints: `1101077781172 1101077783668 bothChaff 1101077781180 1101077783676 oneDegen` [download] True laziness is hard work	[reply] [d/l] [select]
Re: Rapid text searches by LanX (Saint) on Jun 06, 2009 at 00:23 UTC
OK, let me describe how I see your problem, you have consecutive IDs in two columns and your looking for the lines where a special ID occurs in one of the columns? I assume it's more complicated than in your example and gaps are possible, so you can't just trivially calculate the right lines to look for? And a binary search is either no option? With a strict ordering, it's like a telephone book, no need to start reading from the first page on, first look for the right city and the starting letters. I would build up a hash of hashes for indexing ranges of line numbers for each column. with 1101077781160 you'll look up $hash1{11010} where you get a second hash to lookup $hash2{77781} telling you the range to search for "160". The firstlevel hash (the city) should have a size thats easily kept in RAM, the second level hashes should be loaded on demand (and some - maybe the last hundred - kept cached). Of course you could have more levels of hashes and I'm not sure about the best way to make hashes persistent and quickly loaded, but this are CPAN-details. Cheers Rolf	[reply]
Re: Rapid text searches by JavaFan (Canon) on Jun 06, 2009 at 16:32 UTC
It seems that you want to search for exact matches columns. If you know all the things you are looking for before hand, I would create a hash with the items you are searching for. So for instance if you are looking for 1101077781160, 1101077783669, and 1101077781182, I'd start off with `my %queries; $queries{$_} = 1 for qw(1101077781160 1101077783669 1101077781182);` [download] Then I'd iterate over the file, split each line into fields, and test each field against the hash previously build. This avoid multiple passes over the file (as you would with grep), or loading in the entire file into a hash. And if I only looking for a single entry, perl script will still load everything into memory before it could look for the string I want? You tell us instead of asking! You wrote the code. If you wrote your code it first slurps in the entire file, then the answer is yes. If you wrote your code it didn't, then the answer is no.	[reply] [d/l]