in reply to Re: Rapid text searches ( O(1) space and time)
in thread Rapid text searches

That is hot and I am, as usual, amazed at your comp-sci chops. I cobbled together a DB_file example because I was curious if it would be slower.

It's not a 1:1 comparison because it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need. I only built my test file up to .75GB-ish because of lack of headroom on my drive (I only have 2GB left and I had to build two of these files so...).

Fake data generator-

my @what = qw( bothChaff bothDegen oneChaff oneDegen ); open my $fh, ">", "data.data" or die; for ( 10000000 .. 99999999 ) { next if rand(10) > 6.7; print $fh sprintf("11010%d\t11010%d\t%s\n", $_, $_+3333, $what[rand@what]); last if -s $fh > 1_000 * 1_000 * 1_000; }

Search + search DB builder-

use DB_File; use strict; use Time::HiRes qw[ time ]; my $start = time; my %A; tie %A, "DB_File", "db.file", O_CREAT|O_RDWR, 0666, $DB_HASH or die; if ( $ARGV[0] eq "build" ) { open my $fh, "<", "data.data" or die; while (<$fh>) { chomp; my ( $key, $val ) = split /\s+/, $_, 2; $A{$key} = $val; } } else { print $A{$ARGV[0]} || "nothing found", $/; printf "Took %.2f seconds\n", time() - $start; } -- moo@cow[331]~/bin>pm-768941 1101078637800 1101078641133 oneChaff Took 0.02 seconds

I got 0.02 seconds on almost every run and this is on a nine year-old computer and, I think, a five year old copy of the related Berekely binaries. I turned the printf to .3f and it was mostly ranging from 0.016 to 0.019 seconds. A caveat being that it took a loooooooong time to build the DB file. I wasn't clocking it but it was edging up on an hour.

<rejoinder type="in good fun!"> The trouble with engineers who are smarter than most everyone else is that given Y they solve for X where X is a more interesting problem than Y. </rejoinder>

Replies are listed 'Best First'.
Re^3: Rapid text searches ( O(1) space and time)
by BrowserUk (Patriarch) on Jun 06, 2009 at 07:09 UTC
    it's a straight key lookup, not a real search like yours; but that's what I took the OP to want/need.

    I think the OP just needs to find the appropriate records in a timely fashion; whether that's done by indexing the file, or searching the file doesn't really matter. Though the hour to build the index could be critical if the file changes with any frequency. If say, it's the output from some other process or system.

    A rather more important difference (assuming I'm reading your code and console log correctly), is that your 0.016 - 0.019 seconds is the time taken to lookup one value.

    My timings:

    c:\test>768941 768941.search >nul Found 1000 records in 0.032 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.036 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.020 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.023 seconds

    are the time to find 1000 values!

    given Y they solve for X where X is a more interesting problem than Y.

    Given 1000 keys, he wanted to find the corresponding values in a file containing ~38e6 pairings as quickly as possible. That's all my code does. I don't understand why you think I'm doing something different?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I don't understand why you think I'm doing something different?

      Oh, I don't really. The OP said "hash" and "single entry" so I felt like a Berkeley was the right answer (if the up front cost of building it is acceptable). Your solution is more flexible and really cool. I just thought my suggestion wasn't a "sledgehammer" but possibly a perfect fit. :)

      Update: I'll see if I can post a 1,000 searches match too; I have to rebuild the thing so it might be later.

      Update to the update: it was about 9 times slower to do 1,000 look-ups, clocking in at .15 for the most part (1.42 GHz G4 Mac).