fortunezhang has asked for the wisdom of the Perl Monks concerning the following question:

Dear Colleagues, I am parsing a large undirected graph, which is stored in a plain text file larger than 4GB. In this file, each line containing two nodes and their associated information. For example:
node1 node2 weight status 23 34 -897 1 24 46 -10 0
It is too large to fill in RAM, so I used the following code to tie it with a hash:
use BerkeleyDB; use MLDBM qw(BerkeleyDB::Hash Storable); my %hash; my @seqRelation; $hash{'key'} = \@seqRelation; my $dbFile = '/tmp/relation.db'; tie %hash, 'MLDBM', -Filename => $dbFile, -Flags => DB_CREATE or die $ +!; # read into the file content my $inFile = shift; open(IN,"zcat $inFile | ") or die "Can not open $inFile:$!"; while(<IN>) { chomp; my @fields = split "\t"; my ($seq1,$seq2,$w,$s) = @fields; push @{$seqRelation[$seq1]}, join(',', $seq2, $w, $s); # also store it in the other direction push @{$seqRelation[$seq2]}, join(',', $seq1, $w, $s); } ......
Here I used @seqRelation to store the data, and then assign it to the hash with the only key 'key', because BerkeleyDB::Hash can only accept hash. The above code is runnable without warning. However, I am confused by the tied file size of /tmp/relation.db. When I checked it after the program finished. It is only 48K, but original file is >4GB. It is unbelievable (but maybe I am wrong because I am not familiar the mechanism of BerkeleyDB). Is this correct or normal? I expected a much larger file size for /tmp/relation.db. I have no idea why it is so small. I am worrying whether some data was missed when tying. By the way, I also need change the status values in my program. Any help or idea is appreciated. Thank you in advance! Best regards! Zhenguo

Replies are listed 'Best First'.
Re: Large files tied by BerkeleyDB with MLDBM
by Eliya (Vicar) on May 06, 2011 at 22:16 UTC

    First, the MLDBM tie mechanism serializes the data structure at the time you make the assignment. I.e., your

    my @seqRelation; $hash{'key'} = \@seqRelation;

    just stores an empty array.  Later changes to @seqRelation will not automatically be updated in the tied storage.  In other words, you'd need to assign it after you've pushed all your data onto it.

    Secondly, note that what you store under a single key will be stored as one string.  So I'm not really sure what you're hoping to achieve with this approach, memory-wise.  If your data doesn't fit into a single data structure in memory, it won't fit into @seqRelation either, and serializing this huge data structure would additionally require quite a lot of memory...

    What might work - if you want to stick with BerkeleyDB - is to store each record separately, e.g. using the line number as the key, which you could then use as a pseudo array index for retrieval.

      Thank you for your reply. I have realized this problem. Anyway I found perl is not good at manipulating large datasets. Now I have converted my data into string before restore and convert it back when retrieving it.