kahn_ray@yahoo.com has asked for the wisdom of the Perl Monks concerning the following question:

Folks, I have a very large text file (100 mil records) that I will need to store as a db file. We will do look up operations on the file only. But it has to be very fast. I understand BerkeleyDB is pretty good at this kind of operation. I have little experience with BerkeleyDB and wanted to find out whether you could offer some words of advise and wisdom. Greatly appreciate your feed backs. -r

Replies are listed 'Best First'.
Re: BerkeleyDB and a very large file
by graff (Chancellor) on Aug 22, 2008 at 01:42 UTC
    The retrieval rate for data in Berkeley DB files is very quick, and does not depend in any noticeable way on the amount of data in a given DB file. If the basic hash-lookup strategy is adequate to your needs, it's a good way to go.

    But doing the initial creation of a really large Berkeley DB file may take a very long time for a data set in your range (100 million records), so have patience (and budget time) for that step.

Re: BerkeleyDB and a very large file
by Illuminatus (Curate) on Aug 21, 2008 at 22:59 UTC
    BerkeleyDB does put a few limitations on you:
    1) Not SQL. Should you decide you need to change databases for some reason, it will be more effort.
    2) I am not sure if BerkeleyDB supports network access. As previously mentioned, network access is significantly slower, but it may be advantageous should you find a need to access the data from more than one system.
    3) If you have lots of memory, and your record sizes are not too large, you can store the entire table in RAM, which makes MySQL much faster than using a disk table.
    If you need to access the database very heavily, and the above are not important to you, BerkeleyDB is probably the way to go.

      Another one, quite important: BerkeleyDB files are platform dependant, i.e. you cannot use a little-endian file on a big-endian platform and vice versa.

      On the other hand, BerkeleyDB doesn't provide network support, it's file based only.

Re: BerkeleyDB and a very large file
by perrin (Chancellor) on Aug 21, 2008 at 21:20 UTC
    Well, there really is nothing faster than BerkeleyDB for doing key lookups on data that's too big to fit in RAM. Depending on your experience, you might be satisfied with just using a local MySQL. If you try MySQL, make sure you connect to it as localhost. It makes a difference in performance.

      And use MyISAM instead of InnoDB or another option (I think there are more now?). It doesn't support foreign key and atomicity stuff but unless something has changed in the last couple years it's still the fastest relational DB and it can be tuned to your specific data if you know how.

        InnoDB is usually faster for me, but for single key unique lookups like this I expect MyISAM will be better.
Re: BerkeleyDB and a very large file
by BrowserUk (Patriarch) on Aug 23, 2008 at 10:28 UTC

    Since no one else has asked I will: What does your data look like?

    That is,

    • how long are the keys?
    • how long are the associated records?

    A couple of typical examples would be good. Configuring the page size and cache size parameters to suit your data can have a strong influence upon performance.

    Also, how are you going to access (search) the records? That is, will your accesses be:

    • Totally random;
    • Initially random followed by sequential;
    • Always sequential;
    • Other

    Selecting the 'right' access method--btree, hash, queue, recno--to achieve best performance depends a lot upon how you access the data. Without knowledge of how your application will operate, advise is at best, generic.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: BerkeleyDB and a very large file
by tilly (Archbishop) on Aug 23, 2008 at 00:07 UTC
    Advice on your initial load. Loading that file naively means doing 100 million writes. Each write means seeking to read, reading, then seeking to write and writing So 200 million seeks on disk. If your disk spins at 6000 RPM it spins 100 times per second, and seek time averages out to the time for it to spin halfway. So you do 200 seeks per second. That takes a million seconds, which is about a week and a half.

    Suppose each record takes 100 bytes. Then you have about 10 GB of data. Sorting with a merge-sort should take about 30 passes. Each pass involves reading and writing all of your data. So that's about 600 GB of data. Supposing that your disk has a sustained throughput of 60 MB/s, that should take 10,000 seconds, or about 3 hours. Plus CPU time. Minus your savings if your sort implementation is smart enough to do some passes in RAM to minimize writing to disk. But any way you cut it, far faster.

    Therefore my standard suggestion for this kind of data volume is to use the Unix sort utility to sort your dataset, then make a BerkeleyDB btree and load that. (Btrees are stored in sorted order, which gets rid of your sustained throughput problems.) This will make your initial data load take just a few hours rather than a week and a half.

    As a bonus, for large datasets if there is any locality of reference in your requests (usually there is), a btree makes better use of cached data in RAM than a hash can.