in reply to Rapid text searches

If, as your sample data suggests, the key numbers are strictly ascending, though not sequential, and your records are fixed length (or can easily be made so with a fast one-pass one-liner), then you can find your values very quickly (O(1) time and space!), by calculating the approximate position and then seeking forward or backward from that starting point.

The following crude demo code:

#! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; my $start = time; my $recLen = 43; open I, '<', '768941.dat' or die $!; my $first = ( split ' ', scalar <I> )[ 0 ]; seek I, -$recLen, 2; ## SEEK_END my $last = ( split ' ', scalar <I> )[ 0 ]; say "$first - $last"; my $recs = tell( I ) / $recLen; my $factor = $recs / ( $last - $first ); my $success = 0; while( my $target = <> ) { chomp $target; warn "Input outside range" if $target > $last or $target < $first; my $offset = int( ( $target - $first ) * $factor ) * $recLen; my $rec; seek I, $offset, 0 or die "Seek failed\n"; my $found = ( split ' ', $rec = <I> )[ 0 ]; if( $found != $target ) { my $direction = $found < $target; $offset = $direction ? $recLen : ( -$recLen * 2 ); seek I, $offset, 1 or die "Seek failed" ## SEEK_CUR while ( $found = ( split ' ', $rec = <I> )[ 0 ] ) and $direction ? ( $found < $target ) : ( $found > $target + ); } if( $found == $target ) { print "Found: $rec"; ++$success; } else { warn "Target $target not found\n"; } } close I; printf STDERR "Found $success records in %.3f seconds\n", time() - $st +art;

Looks up 1000 (existing) records:

c:\test>wc -l 768941.search 1000 768941.search c:\test>head 768941.search 1101077781160 1101077781161 1101077781162 1101077781163 1101077781164 1101077781165 1101077781166 1101077781167 1101077781168 1101077781169

in an approximation of your data file:

c:\test>dir 768941.dat 06/06/2009 03:55 1,625,416,082 768941.dat c:\test>head 768941.dat 1101077781160 1101077783656 bothDegen 1101077781161 1101077783657 bothChaff 1101077781162 1101077783658 bothDegen 1101077781163 1101077783659 bothChaff 1101077781164 1101077783660 oneChaff 1101077781165 1101077783661 bothChaff 1101077781166 1101077783662 oneDegen 1101077781167 1101077783663 bothDegen 1101077781168 1101077783664 bothChaff 1101077781169 1101077783665 bothDegen c:\test>wc -l 768941.dat 37800374 768941.dat

In two or three hundreths of a second:

c:\test>768941 768941.search >nul Found 1000 records in 0.02 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds

But that is with a warm cache. It took a whole 1/10th of a second the first time :) That's faster than you could make a connection to most databases.

The trouble with people who have sledgehammers, everything looks like a tough nut.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Rapid text searches ( O(1) space and time)
by Your Mother (Archbishop) on Jun 06, 2009 at 06:07 UTC

    That is hot and I am, as usual, amazed at your comp-sci chops. I cobbled together a DB_file example because I was curious if it would be slower.

    It's not a 1:1 comparison because it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need. I only built my test file up to .75GB-ish because of lack of headroom on my drive (I only have 2GB left and I had to build two of these files so...).

    Fake data generator-

    my @what = qw( bothChaff bothDegen oneChaff oneDegen ); open my $fh, ">", "data.data" or die; for ( 10000000 .. 99999999 ) { next if rand(10) > 6.7; print $fh sprintf("11010%d\t11010%d\t%s\n", $_, $_+3333, $what[rand@what]); last if -s $fh > 1_000 * 1_000 * 1_000; }

    Search + search DB builder-

    use DB_File; use strict; use Time::HiRes qw[ time ]; my $start = time; my %A; tie %A, "DB_File", "db.file", O_CREAT|O_RDWR, 0666, $DB_HASH or die; if ( $ARGV[0] eq "build" ) { open my $fh, "<", "data.data" or die; while (<$fh>) { chomp; my ( $key, $val ) = split /\s+/, $_, 2; $A{$key} = $val; } } else { print $A{$ARGV[0]} || "nothing found", $/; printf "Took %.2f seconds\n", time() - $start; } -- moo@cow[331]~/bin>pm-768941 1101078637800 1101078641133 oneChaff Took 0.02 seconds

    I got 0.02 seconds on almost every run and this is on a nine year-old computer and, I think, a five year old copy of the related Berekely binaries. I turned the printf to .3f and it was mostly ranging from 0.016 to 0.019 seconds. A caveat being that it took a loooooooong time to build the DB file. I wasn't clocking it but it was edging up on an hour.

    <rejoinder type="in good fun!"> The trouble with engineers who are smarter than most everyone else is that given Y they solve for X where X is a more interesting problem than Y. </rejoinder>

      it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need.

      I think the OP just needs to find the appropriate records in a timely fashion; whether that's done by indexing the file, or searching the file doesn't really matter. Though the hour to build the index could be critical if the file changes with any frequency. If say, it's the output from some other process or system.

      A rather more important difference (assuming I'm reading your code and console log correctly), is that your 0.016 - 0.019 seconds is the time taken to lookup one value.

      My timings:

      c:\test>768941 768941.search >nul Found 1000 records in 0.032 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.036 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.023 seconds

      are the time to find 1000 values!

      given Y they solve for X where X is a more interesting problem than Y.

      Given 1000 keys, he wanted to find the corresponding values in a file containing ~38e6 pairings as quickly as possible. That's all my code does. I don't understand why you think I'm doing something different?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I don't understand why you think I'm doing something different?

        Oh, I don't really. The OP said "hash" and "single entry" so I felt like a Berkeley was the right answer (if the up front cost of building it is acceptable). Your solution is more flexible and really cool. I just thought my suggestion wasn't a "sledgehammer" but possibly a perfect fit. :)

        Update: I'll see if I can post a 1,000 searches match too; I have to rebuild the thing so it might be later.

        Update to the update: it was about 9 times slower to do 1,000 look-ups, clocking in at .15 for the most part (1.42 GHz G4 Mac).

Re^2: Rapid text searches ( O(log n) time)
by LanX (Saint) on Jun 06, 2009 at 20:20 UTC
    It's been long since I learned the big O notation but IMHO your code can only be O(1) if and only if there are no gaps within the data.

    Just consider the worst case of IDs within (1, (1/2 * m) .. m) i.e. (n= 1/2 m + 2) and your looking for the second entry x=m/2 right after the maximum possible gap. You'll start with approximating 3/4 m and continue to seek linearly till m/2. That are 1/2 n steps which means O(n).

    If you improve your code to recursively interpolating the next position between the narrowed range instead going linearly you'll make it with O(log(n)) like in binary search.

    Furthermore you're only comparing the entries from the first column aren't you?

    Cheers Rolf

      The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).

      For your "worst case scenario to be so, it would require the file to contain 38e6-1 consequetive numbers and a single outlier 38e6 away from one end. My conclusion was that is an unlikely scenario for real data, My assumption is for a reasonably even distribution of the values through the range.

      Furthermore you're only comparing the entries from the first column aren't you?

      I do not understand the question?

      I take the ops "Let say I need to know the pair for 1101077781160" to mean that given the input '1101077781160', he requires the other two values on that line. Eg. '1101077783656 bothChaff'


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Sorry IMHO O() is usually calculated for the worst case, otherwise "traveling salesman" and similar problems wouldn't be NP, even though there are well known algorithms which are "usually" polynomial!

        You may want to calculate the complexity depending of what you called $factor, with $factor==1 its O(1) with $factor==1/2 it's O(n).

        Anyway with binary search it's O(log n), which is the theoretical optimum for this class of problems. So O(1) would be considered a sensation in academic circles! IMHO binary search is mostly faster than your algorithm.

        If you want to argue with something like average-case analysis you should indicate it clearly, since worst case is the usual approach.

        The average case is much more complicated to define and to calculate. You have no clue what the average case really looks like, without far more input from the OP.

        BTW: You can hardly compare collisions in hashes since they should be the exception not the rule.

        Cheers Rolf

        The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).
        That is because it's expected that the chains are short, and independent on the number of keys in the hash.

        But your claim basically means that any sorted datastructure that doesn't have duplicates could, after linear processing to line up the elements, be searched in O(1) time. Your algorithm only achieves O(1) search time if the data is pretty uniformly spread between its extremes. But that's an exceptional case.

Re^2: Rapid text searches (complexity test)
by LanX (Saint) on Jun 12, 2009 at 13:10 UTC
    As promised I did a brute force calculation for the odds that this algorithm is fast.

    Extrapolating the example from the OP in a 1.5 GB File and randomly skippping 1 out of 4 lines it seems you'll miss the searched line in average for more than 1200 entries before you start searching linearly.

    I stopped after 2 test runs since my laptop with 1 GB RAM was heavily swapping for 30 minutes then* ... I'll post the completed results tomorrow morning. : )

    Temporary conclusion, over 1200 seeks doesn't even look close to O(1) for me, could it be your testdata was a little optimistic? Hope you understand my critic now.

    Please Note: This is the very optmistic best case for only a 25% gap probability. I didn't even try to insert any worsening conditions now!

    That much about the normal average case.

    Cheers Rolf

    (*) I could boost the testperformance by avoiding to build up the @data array and just approximating the accuracy of the interpolation within the loop, but I'm already too bored now!

    CODE:

    OUTPUT: