Lowry76 has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I'am searching for a fast and convenient way to store and retrieve biological data from large data files (>2GB) but
I can't use any RDBMS etc, just flat files.
First I stored the complete file in XML format. To fetch particular datasets I had to read in the file completely (in worst case).

Then I read about DB_File/BerkeleyDB and figured that it might be useful. I tried it with a smaller description file (~50MB)
and everything seemed to work quite nice and search time was significantly improved.
But then I tried one of the bigger files (1.2GB) and had to stop my script after 15 minutes because the resulting database file
was already >6GB!

My questions are:
1. Is it my code that's going wrong for large files(see code below)?
OR
2. Is it a typical behavior of File_DB and huge datafiles, i.e. normal that the output file is more than 6 times bigger than the datafile?
AND IF SO
3. Is there a better solution? (I read something about File::Tie instead of DB_File tie...)


I appreciate any help.

Cheers,
Lowry
# Takes the output directory name $annotationDirName # and a hash-ref of all attribtues of a sequence and # adds it to database '$annotationDirName/annotation.dat'. # Given attribute hash is stored as hash-dump. sub _printAnnotation{ my ($annotationDirName,$h) = @_; # create new Berkeley_DB my %database; tie %database, 'DB_File', "$annotationDirName/annotation.dat" or d +ie "Can't initialize database: $!\n"; my $dump = Data::Dumper->new([$h],[qw($annotation)])->Purity(1)->D +ump(); $database{$h->{'id'}} = $dump; untie %database; }

Replies are listed 'Best First'.
Re: DB_File/BerkeleyDB with large datafiles
by Tux (Canon) on Sep 20, 2010 at 11:26 UTC

    If DBD::SQLite is an option, even though it is not a real plain file, then Tie::DBD::Hash might be what you want. I wrote that because I had a very similar problem.


    Enjoy, Have FUN! H.Merijn

      I really like that suggestion, and would encourage you to pursue it if you can.   An SQLLite database has many of the advantages of a Berkeley DB-file, but it can easily store more than one field.   There might be size limits, however.

      Another thought is ... what if you stored the data as individual files, in a pre-determined directory, then used (say) an SQLLite database file simply to index them?   You query in order to find the names of the files that contain the information you want.   Then, you open the individual files to get it.

      File systems are, in effect, “databases” that are well-optimized to store gigabytes of data under millions of keys...

Re: DB_File/BerkeleyDB with large datafiles
by BrowserUk (Patriarch) on Sep 20, 2010 at 12:58 UTC
    First I stored the complete file in XML format.

    Are you saying you converted the original data into XML? Or the data comes to you in XML?

      I converted it into XML. I parsed a huge file and wrote the information interesting for me to an XML file.
      But to retrieve information from this file I might have to read it in completely (if the dataset is the last one
      in the file). So I'm searching for something a little more convenient/faster but still just using flat files.

        Storing data used for random retrieval in XML is ... um ... never going to work very well.

        (Some would say that storing data in XML is a mistake full stop--me amongst them--but don't let that put you off :)

        Could you show us one or two records of your data? Also, indicate how you need to retrieve them. Ie. By record number; or some embedded identifier; etc.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: DB_File/BerkeleyDB with large datafiles
by graff (Chancellor) on Sep 21, 2010 at 00:51 UTC
    Given this bit in your snippet of code:
    ... tie %database, ... ... $database{$h->{id}} = $dump; ...
    It would seem that your intention/requirement is simply to fetch a chunk of data by means of a look-up key/id string. (That is, you don't need to search on the basis of the data content associated with the key/id.)

    If that's true -- and if it's also true that the data content linked to each key/id remains static -- then it might suffice to have the full (2+GB) data file in whatever form is most convenient generally (assuming this is read-only), and build a separate index (e.g. with Berkeley DB) in which each key/id string is associated with the byte offsets for the associated data chunk in the big file.

    It would be quick and easy to do one pass over the big file to create a separate listing of key/id strings with the byte-offsets of their associated records (e.g. start_byte, n_bytes). Then build a DB_File (or equivalent) index of the keys and byte-offsets (instead of keys and data blocks).

    Depending on how many records you have, a full hash of key+byte_offset pairs might even fit in ram...

      Actually I think I will give it a try. I was aware of the solution but never tried something like this before.
      Have to look up how to do it but guess it will be the best solution.
      Thanks