stephentyler has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I'm in a classic position of having data in a flat file that has grown large over time and is starting to take too long to search. The technology needs to move on. I've searche the databse here and haven't found so much about large constant datbases and hence this...

The existing setup
15,000 records/lines with about 12 pipe delimited fields.
File size - 15 MB
each week we add about 40 new records to the database - it's a growing problem!
It's essentially a constant database and in real time we don't add things in or sort etc.

Of the 12 fields we allow searching on 7 defined fields. Sometimes we just search on one field but much of the time we search on 3 or more fields.

for info the 7 fields we search on occupy about 10Mb of the 15MB total

Searching is done with an old and somewhat modified Salina Sol cgi perl script which iterates over the text file a line at a time.

The script runs under cgi - not mod_perl or any special environment. The webserver is Apache.

Searching on one field takes about 10 seconds to pull up 40 results

and searching on 3 fields takes about 12 secs.

I like the existing setup because we can see the data easily and correct any problems etc.


A new setup

I'd like a system that could handle 45,000 records of the existing size and search them on the same 7 fields in reasonable time - say 5 seconds on my setup.

I'm not too keen to go down a full blown MySQL type route because I belive that searching on multiple fields will be slower then other possible solutions and I don't need the benefits of a relational db particularly. I know some people will be appalled at this view and to them I apologise in advance.

I'm open to other suggestions on the best way forward (and I suppose relational solutions too if they can meet the spec). It occurred to me that I could put all the information in a cdb database, each line having a unique key and have a smaller database, with just the 7 fields, and the key, held in memory. The figures would show that as being about 15Mb in size. If it were done with mod_perl then the searchable bit of the databases would be loaded once from a disk file and thereafter searches should be pretty quick - if with the on-cost of getting the full entries out of the cdb database. But I have never used mod_perl before and my perl experience is pretty low I have to say.

Thanks for your help asnd ideas...

Replies are listed 'Best First'.
Re: Large Constant Database in Text File
by shmem (Chancellor) on Aug 31, 2006 at 10:28 UTC
    I second the use of SQLite (DBD::SQLite) together with DBI. It's basically a library which operates on databases contained in single files. Easy to set up and operate.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Large Constant Database in Text File
by monkey_boy (Priest) on Aug 31, 2006 at 10:07 UTC
    sqlite!


    This is not a Signature...
Re: Large Constant Database in Text File
by erix (Prior) on Aug 31, 2006 at 10:35 UTC

    Obviously, you need *some* sort of indexing.

    I use BerkeleyDB (formerly http://www.sleepycat.com/ - now bought by oracle) with BerkeleyDB to index large flatfiles (on substrings, regexen, whatever).

    This will let you keep a setup with the original text files (as you seem to prefer over parsing it to bits into a regular rdbms). Of course, after editing the text file you have to rebuild the index.

    Update: BerkelyDB bought by Oracle? Out with it! :) (that data is now in postgresql)

Re: Large Constant Database in Text File
by BrowserUk (Patriarch) on Aug 31, 2006 at 18:15 UTC

    With these volumes of data, 45,000 record/45 MB and 7 from 13 fields indexed using a Storable index, I achieved sub-1-second times to

    1. load both the data and index;
    2. Perform 1000 7-key searchs
    3. Retrieve 3 records for each of the 1000 matches.

    Benchmark output:

    c:\test>570563.pl 1 trial of load index (593.750ms total) 1000 trials of Perform 7-key search and retrieve 3 records (113.182ms +total), 113us/trial c:\test>570563.pl 1 trial of load index (625ms total) 1 trial of load datafile (94.282ms total) 1000 trials of Perform 7-key search and retrieve 3 records (117.603ms +total), 117us/trial c:\test>570563.pl Check memory 1 trial of load index (640.625ms total) 1 trial of load datafile (112.418ms total) 1000 trials of Perform 7-key search and retrieve 3 records (116.873ms +total), 116us/trial

    Betcha can't achieve that performance with an RDBMS :)

    POC code:

    Flat file indexer (runs once) :

    #! perl -slw use strict; use Storable; open DB, '<', '570563.dat' or die $!; my @index; my $offset = 0; while( <DB> ) { chomp; my @fields = split '\|'; push @{ $index[ $_ ]{ $fields[ $_ ] } }, $offset for 0,2,4,6,8,10,12; ## 7 x 100 byte fields indexed. $offset = tell DB; } close DB; store \@index, '570563.idx'; ## Index stored to file

    Search & retrieval benchmark (999 sets of 7 search keys in __DATA section omitted for posting):

    #! perl -slw use strict; use Benchmark::Timer; use Storable; my $T = new Benchmark::Timer; $T->start( 'load index' ); my $index = retrieve '570563.idx'; $T->stop( 'load index' ); $T->start( 'load datafile' ); my $ramfile; open my $dataFH, '<', '570563.dat' or die $!; sysread $dataFH, $ramfile, -s( '570563.dat' ); close $dataFH; open my $ramFH, '<', \$ramfile; $T->stop( 'load datafile' ); ## printf 'Check memory'; <STDIN>; while( <DATA> ) { $T->start( 'Perform 7-key search and retrieve 3 records' ); chomp; my @fields = split '\|'; my %matches; $matches{ $_ }++ for map{ @{ $index->[ $_ ]{ $fields[ $_ ] } } } 0, 2, 4, 6, 8 +, 10, 12; for ( grep{ $matches{ $_ }== 7 } keys %matches ) { seek( $ramFH, $_, 0 ); # print "$_: ", ## Disable printing for benc +hmark substr <$ramFH>, 0 , 100; } $T->stop( 'Perform 7-key search and retrieve 3 records' ); } $T->report; __DATA__ rlJUCO1XGHEA1lULImByZiPS7rIkJldqyQrc9gTvppxtOe3Ae6jIHbTnFKLCHZVco8T2lK +wz1HSnZ1bunO8gzwq9ftDtVWvjpuJU||Cg73CVHXVUiORGbVcXVVi0OueBikEIP7KvQKk +egTX6Co6wDqow5hRvYHEaWCrtkn1LZ6xR6Lnz1hmKSqOEy0WYBKOJ7tnRgJ7gYV||BgvL +wMdDqdQSyexHXtJWIhs5xPoePlH8I9I8AbZpOqpmFOD6N0PAzcgTxrr6BHBvSdFGD4t8H +m2o1mrhASjKXiaYz2HgrQBUBn3A||3gNrTZcXj9hyoVuBWfLxmwKqkaEqmErDJ7zDaJvI +QHrSoGgZm3Gzz1oTbg2lGB8m459jpzdEUIFanQHMAAfXuMPEjA9EdwgWAC7q||TYKbYAv +6m5vTvj7Hs7yIG86HeylN1nTkZgbc4P4MIULrYXHilMZQIJ0QtuNlt8IP45kCPhY52Ecr +f0LxdEYeCWw5vhzSoXOtCKAL||6GFr5Kwz1SvVaCChBoBQg6cJBGtvBcEIi8wbymtL8Jk +ISoz9lVA5S1hZepCJ3s3j5H9ui2LNfVaiKlSG6cpzeCajS710UxkNB3ym||jq27ex5Zwt +uVXyGQ2zvU4xchkvI8CdwudMX64tm196yMExgY7MmmYo8tzAo0hc5dcyrQDR5AfFdEQr7 +EnNPecuzlETBFD3Kn7YcP [999 similar records ommited for posting]

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you - 1 second searches sounds rather good!

      I have to say that its noticeable that the 2 responses that talk performance come from people comfortable with a non-relational approach...

      I'd be interested to hear from any of the relational database proposers of an equivalent relational database based solution to what's here and the performance delivered.

      I'll have to study the code because I'm not as good at Perl as I'd like to be but I much appreciate the working sketch and it looks like a solution is possibly easier to achieve than I imagined. Thanks once again.

        The bottom line is that the data and indexes reside on a disk somewhere. Your process can either:

        • Load them itself, follow the pointers and extract the required data.
        • Or:
          1. Compete for attention from a DB process.
          2. Wait for connection.
          3. Negotiate authorisation.
          4. Wait while the SQL parser parses your query.
          5. Wait while the DBM checks your query against the schema.
          6. Wait while the DBM finds the relevant data and indexes.
          7. Wait while the DBM optimises your query.
          8. Wait while it checks/frees cache space for your indexes.
          9. Wait while it actually runs the query and builds a results table.
          10. Wait while the communications negotiate the transfer of the results.
          11. Wait while your process allocates space to hold the results.
          12. Wait while it structures the results into a perlish data structure.
          13. Finally, do something with the results if you haven't run out of memory in the meantime.

        There are many good reasons for using an RDBMS, not least of which is that someone else does a lot of the work and maintenance, but performance is not one of them!


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Large Constant Database in Text File
by aquarium (Curate) on Aug 31, 2006 at 10:13 UTC
    one of the prime reasons for using a relational database engine is to index fields so you can search quickly. the database engine takes care of caching blocks of the indexes etc., so it doesn't do a sequential scan. you haven't said what kind of searches you're doing on the fields, so i can't give a definite answer whether a database engine would realy benefit. complex searches that try to partially match fields usually can't use an index anyway. if the searches lend themselves to a initial screening, then you could use grep utility and output to a temp file that you then scan.
    the hardest line to type correctly is: stty erase ^H
      Hi

      Thank you for asking for more info, rather then saying it has to be relational, relational, relational!!

      To use your terminology it looks like I am doing a complex search - the vast majority of searches we do (probably over 95%) are to partialy match fields.

      So if I have a field with this text in "Timbrals Theatre"

      then a search on "Tim" would be a match. The script also has a 'match whole words' option, which if activated would mean the search on "Tim" would fail, but a search on "Timbrals or Theatre" would match.

      From the context its ounds like teh relational solution would struggle here - is that right?

Re: Large Constant Database in Text File
by perrin (Chancellor) on Aug 31, 2006 at 13:24 UTC
    If search performance is your concern, you won't beat an RDBMS. This is what databases do. The db will look at the fields being searched, choose the best index that eliminates the most rows, and scan the rest. If you use a recent one like MySQL 5, it may even merge multiple indexes if possible to find the records.

    If you have some reason why you just can't use an RDBMS, you'll have to build your own lame one. You can do it by indexing the most commonly searched fields with something like SDBM or BerkeleyDB, picking an index, and scanning the resulting rows. This will probably be more work for less speed though.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Large Constant Database in Text File
by Tanktalus (Canon) on Aug 31, 2006 at 17:47 UTC

    I'm not going to second (third? fifth? whatever... ;-}) the SQLite route. Well, not right away.

    It seems that you have some sort of aversion to RDBMS's that I don't fully understand, but we can work with it. I suggest you convert your application to use DBI with DBD::CSV instead. DBD::CSV, contrary to what the name seems to imply, can use pipes as delimiters just fine, if you tell it to.

    Once you've done that, you can fairly easily swap out the underlying datastore from flat, pipe-delimited files to SQLite or mySQL or DB2 or Oracle or ... whatever you want. Largely, this will mean just changing the DBI->connect line after moving your data over.

    To be honest, DBD::CSV has been sufficient for me for a number of projects. But none of them have 15,000 records ;-) I went down this road thinking "this is a bunch of data - I should use some sort of query language to retrieve it, but I don't want to set up a database for only 200 records." So I used DBI to ensure I can switch out the underlying data structure at will. Or at least something close to "at will" - if someone writes a DBD::XML, I could use XML to store the data, but in the meantime, I have a lot of flexibility.

    I would predict that once you convert this way, you'll be tempted to look at the real DBMS's, and, once you do, you will wonder why you took so long to switch. Your lookup times will drastically drop, and your manager will really like it. ;-)

Re: Large Constant Database in Text File
by jdtoronto (Prior) on Aug 31, 2006 at 20:52 UTC
    I am going to stick my oar in and say, why not just go to a RDBMS, which you may already have installed on your machine anyway? And if not, it won't be difficult.

    Over a period of years I, my contractors, and staff, had written many applications where I found we had used BerkleyDB, SQLite, DBD::CSV and MySQL. In the end this introduced a hopeless mess of things on the servers which became a hopeless nightmare to maintain and back-up.

    About two years ago I decided enough was enough - if it is in MySQL then we can maintain a single back-up strategy. So I have applications that have from 50 to 60,000,000 records, yep, that's right! All using MySQL and we don't have to worry. Even a 60 million record table doesn't take ALL that long to search.

    In some applications, we even, horror of horrors according to some monks, store HTML templates in the database so we can use them from anywhere. This may not be music to the purist, but in an organisation where I usually only have one full-time staffer and maybe two or three piece-work contractors I need to keep things so I can maintain them and back them up with minimal fuss.

    My MySQL gets backed up automatically every night, so I can be reasonably confident of my data security.

    jdtoronto

Re: Large Constant Database in Text File
by davido (Cardinal) on Aug 31, 2006 at 16:46 UTC

    ...and I don't need the benefits of a relational db particularly.

    But what you're describing is specifically a set of needs met by the benefits of a relational DB.

    I third or fourth the suggestion for DBD::SQLite. It is an excellent solution for a lightweight database which provides the features (and benefits) that will meet the needs you've outlined in your post. Plus, it's self-contained; install the module, install DBI, and write a little script to convert your existing flat file to a SQLite database. Then modify your main script to use DBI instead of flat file lookups, and you're done. The actual database will consist of a single file. And you don't have to install all sorts of heavyweight database software. Performance is very good. Indexed lookups are fast. It sounds to me like an ideal solution for the criteria you've defined.


    Dave

Re: Large Constant Database in Text File
by bart (Canon) on Aug 31, 2006 at 18:25 UTC
    Constant database? Take a look at CDB_File (and cdb):
    cdb is a fast, reliable, lightweight package for creating and reading constant databases.
    You'll have to recreate it every time you need to change or add a few records.

    I also have to point out there are still a few as yet unresolved problems compiling it on Windows.

Re: Large Constant Database in Text File
by Anonymous Monk on Sep 01, 2006 at 23:54 UTC
    For 45000 records, 15MB, I'm pretty sure you could get an answer from a flat file in less than 5 seconds --- at least that's what I am getting with my File::Tabular module. Give it a try if you want to stick to a flat file -- and let me know about your results.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Large Constant Database in Text File
by mattr (Curate) on Sep 05, 2006 at 07:47 UTC
    15MB is not a lot of data, <1 sec. response seems possible (as another poster notes) if it uses preprepared indices. Possibly even without an indice, the pure C searching within a database may be in that range. You are losing time with I/O; I'd be surprised a regex-based search on data that is already in memory even takes as long as you say.

    Anyway it is true that dbs have limited full text search functionality, what you are asking for is a LIKE (or wildcard) search, plus maybe boolean operators. It will be a lot easier to use a db, really.

    On the other hand, I've searched 1GB of data without a relational database in 0.1 seconds (using C++ based htdig behind a mod_perl wrapper). I've searched 10 megabytes of data with a single index and regex in about 1-2 seconds too and that was on a 133MHz P2 IIRC.

    Typically these speeds are achieved without an rdbms by precompiling inverted indices (hashes) on the columns (keys) in which you are interested. For wildcard searches I have seen a technique that builds a hash including all substrings of every word. In reality though the maintenance of these inverted indices is a pain (they have to be rebuilt periodically, and often you end up trying to tweak mysterious parameters to improve performance.. also sometimes no wildcard support).

    So I'd also recommend a database, if you can get one, but if not then yes for the scale you are talking about you ought to be able to get far better performance than now with the use of precompiled (and periodically updated) indices, maybe just using standard perl data storage modules. But don't step through your current text files a line at a time, that is the job your index generator will do every night.

Re: Large Constant Database in Text File
by Anonymous Monk on Sep 05, 2006 at 16:04 UTC
    If your partway searches are from the beginning of the words an index should work fine, because essentially it is a sorted list of words where a binary search can be performed. If a database by seeing "partway search" stupidly refuses to use an index, whether its from the beginning or not, you can always make your own. (index not DB:-) But lets talk about the other possibility: Random starting points. Then your best way is to store your text bzip2-compressed. Because a Burrough-Wheeler-transformed text is its own index for fulltext search. The precise algo is in an old DDJ, but I don't know if it's online by the time you read this. Google for the names of the developers.