in reply to Large Constant Database in Text File

With these volumes of data, 45,000 record/45 MB and 7 from 13 fields indexed using a Storable index, I achieved sub-1-second times to

  1. load both the data and index;
  2. Perform 1000 7-key searchs
  3. Retrieve 3 records for each of the 1000 matches.

Benchmark output:

c:\test>570563.pl 1 trial of load index (593.750ms total) 1000 trials of Perform 7-key search and retrieve 3 records (113.182ms +total), 113us/trial c:\test>570563.pl 1 trial of load index (625ms total) 1 trial of load datafile (94.282ms total) 1000 trials of Perform 7-key search and retrieve 3 records (117.603ms +total), 117us/trial c:\test>570563.pl Check memory 1 trial of load index (640.625ms total) 1 trial of load datafile (112.418ms total) 1000 trials of Perform 7-key search and retrieve 3 records (116.873ms +total), 116us/trial

Betcha can't achieve that performance with an RDBMS :)

POC code:

Flat file indexer (runs once) :

#! perl -slw use strict; use Storable; open DB, '<', '570563.dat' or die $!; my @index; my $offset = 0; while( <DB> ) { chomp; my @fields = split '\|'; push @{ $index[ $_ ]{ $fields[ $_ ] } }, $offset for 0,2,4,6,8,10,12; ## 7 x 100 byte fields indexed. $offset = tell DB; } close DB; store \@index, '570563.idx'; ## Index stored to file

Search & retrieval benchmark (999 sets of 7 search keys in __DATA section omitted for posting):

#! perl -slw use strict; use Benchmark::Timer; use Storable; my $T = new Benchmark::Timer; $T->start( 'load index' ); my $index = retrieve '570563.idx'; $T->stop( 'load index' ); $T->start( 'load datafile' ); my $ramfile; open my $dataFH, '<', '570563.dat' or die $!; sysread $dataFH, $ramfile, -s( '570563.dat' ); close $dataFH; open my $ramFH, '<', \$ramfile; $T->stop( 'load datafile' ); ## printf 'Check memory'; <STDIN>; while( <DATA> ) { $T->start( 'Perform 7-key search and retrieve 3 records' ); chomp; my @fields = split '\|'; my %matches; $matches{ $_ }++ for map{ @{ $index->[ $_ ]{ $fields[ $_ ] } } } 0, 2, 4, 6, 8 +, 10, 12; for ( grep{ $matches{ $_ }== 7 } keys %matches ) { seek( $ramFH, $_, 0 ); # print "$_: ", ## Disable printing for benc +hmark substr <$ramFH>, 0 , 100; } $T->stop( 'Perform 7-key search and retrieve 3 records' ); } $T->report; __DATA__ rlJUCO1XGHEA1lULImByZiPS7rIkJldqyQrc9gTvppxtOe3Ae6jIHbTnFKLCHZVco8T2lK +wz1HSnZ1bunO8gzwq9ftDtVWvjpuJU||Cg73CVHXVUiORGbVcXVVi0OueBikEIP7KvQKk +egTX6Co6wDqow5hRvYHEaWCrtkn1LZ6xR6Lnz1hmKSqOEy0WYBKOJ7tnRgJ7gYV||BgvL +wMdDqdQSyexHXtJWIhs5xPoePlH8I9I8AbZpOqpmFOD6N0PAzcgTxrr6BHBvSdFGD4t8H +m2o1mrhASjKXiaYz2HgrQBUBn3A||3gNrTZcXj9hyoVuBWfLxmwKqkaEqmErDJ7zDaJvI +QHrSoGgZm3Gzz1oTbg2lGB8m459jpzdEUIFanQHMAAfXuMPEjA9EdwgWAC7q||TYKbYAv +6m5vTvj7Hs7yIG86HeylN1nTkZgbc4P4MIULrYXHilMZQIJ0QtuNlt8IP45kCPhY52Ecr +f0LxdEYeCWw5vhzSoXOtCKAL||6GFr5Kwz1SvVaCChBoBQg6cJBGtvBcEIi8wbymtL8Jk +ISoz9lVA5S1hZepCJ3s3j5H9ui2LNfVaiKlSG6cpzeCajS710UxkNB3ym||jq27ex5Zwt +uVXyGQ2zvU4xchkvI8CdwudMX64tm196yMExgY7MmmYo8tzAo0hc5dcyrQDR5AfFdEQr7 +EnNPecuzlETBFD3Kn7YcP [999 similar records ommited for posting]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Large Constant Database in Text File
by Anonymous Monk on Sep 04, 2006 at 12:12 UTC
    Thank you - 1 second searches sounds rather good!

    I have to say that its noticeable that the 2 responses that talk performance come from people comfortable with a non-relational approach...

    I'd be interested to hear from any of the relational database proposers of an equivalent relational database based solution to what's here and the performance delivered.

    I'll have to study the code because I'm not as good at Perl as I'd like to be but I much appreciate the working sketch and it looks like a solution is possibly easier to achieve than I imagined. Thanks once again.

      The bottom line is that the data and indexes reside on a disk somewhere. Your process can either:

      • Load them itself, follow the pointers and extract the required data.
      • Or:
        1. Compete for attention from a DB process.
        2. Wait for connection.
        3. Negotiate authorisation.
        4. Wait while the SQL parser parses your query.
        5. Wait while the DBM checks your query against the schema.
        6. Wait while the DBM finds the relevant data and indexes.
        7. Wait while the DBM optimises your query.
        8. Wait while it checks/frees cache space for your indexes.
        9. Wait while it actually runs the query and builds a results table.
        10. Wait while the communications negotiate the transfer of the results.
        11. Wait while your process allocates space to hold the results.
        12. Wait while it structures the results into a perlish data structure.
        13. Finally, do something with the results if you haven't run out of memory in the meantime.

      There are many good reasons for using an RDBMS, not least of which is that someone else does a lot of the work and maintenance, but performance is not one of them!


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.