Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have two questions and I'd appreciate any sample code to solve it. I've used to store a large hash in the memory and since it is large I had to have more than 20-30 GB of ram to do it. Since I could access it I was running my code in this way. Now I want to convert my hash into database in the physical disc and then access it using sql/mysql queries whenever I want. I have this code to read my hash:
sub read_dict{ my $file=$_[0]; my $p=$_[1]; my $direction=$_[2]; my %dict1; open( FILE, "gunzip -c $file |") or die $!; # for each line of input file while (<FILE>) { chomp; local ($l1, $l2, $pr) = split /\|\|\|/; chomp ($l1, $l2,$pr); my @pr = split (/\s/,$pr); chomp $pr[1]; if ($direction) { if ($pr[3]>=$p) { $l1 =~ s/[[:punct:]]//g; $l1 =~ s/\s+/ /g; $l2 =~ s/[[:punct:]]//g; $l2 =~ s/\s+/ /g; $l2 =~ s/^\s+//; $l2 =~ s/\s+$//; $l1 =~ s/^\s+//; $l1 =~ s/\s+$//; chomp ($l1, $l2); if ($l2 && $l1) { push( @{$dict1{lc($l1)}}, lc($l2) );} }} else { if ($pr[1]>=$p) { $l1 =~ s/[[:punct:]]//g; $l1 =~ s/\s+/ /g; $l2 =~ s/[[:punct:]]//g; $l2 =~ s/\s+/ /g; $l2 =~ s/^\s+//; $l2 =~ s/\s+$//; $l1 =~ s/^\s+//; $l1 =~ s/\s+$//; chomp ($l1, $l2); if ($l2 && $l1) { push( @{$dict1{lc($l2)}}, lc($l1) );} }} } # return the hash of arrays!!! return \%dict1; }
This was the first part of the problem that I would like to store the hash line by line possibly in a database. Since, I'm not familiar with databases, you sample code would really help me.
Regarding the access to the hash I have such codes :
if (grep {$_ eq $string} @uniq_strings) { $match++ ; } #using hash elsif (exists $dict{$string}) { # loop over all elements in array of hash LOOP: foreach $element (@{$dict{$string}}) { if (grep {$_ eq $element} @uniq_strings) { $match++ ; last LOOP; } } }
Si I want to convert the dict search from hash to SQL format since it is the way to access database. I'd appreciate your helps.

Replies are listed 'Best First'.
Re: storing a large hash in a database
by hbm (Hermit) on Dec 02, 2011 at 16:26 UTC

    I had a process with a hash that grew too big for RAM, and tied it to disk with a few simple lines:

    use DB_File; # too big for RAM now my $db = '/path/to/some.db'; my %hash; unlink $db; # to start fresh tie (%hash, "DB_File", $db, $DB_BTREE) or die("Unable to tie $db: $!") +; # do things as before untie %hash;

    Performance is fine.

      Interesting, may I know if the database would be deleted after the program ends or it will be kept in the physical memory. In the case that it is deleted can I force it to be in physical memory without deleting it?

        It is not in memory. And if you don't unlink the file, it remains on disk; and if you re-tie, you have access to it.

        In the snippet I gave, I unlink before tie'ing; because for my application, I want to start anew every time.

      With your great advise, I could do it with DB_File. Now I have a DB_file (*.db) in my hard disk which is about 5 GB, However, I can not re-access it.
      $DB_BTREE->{'flags'} = R_DUP ; my $x = tie %dictf, "DB_File", "f.db", O_RDWR|O_CREAT, 0666, $DB_BTREE + or die "Cannot open file f.db: $!\n"; my $number = keys %dictf; print "$number\n"; my $number = keys %dictf; print "$number\n"; $key = $value = 0 ; for ($status = $x->seq($key, $value, R_FIRST) ; $status == 0 ; $status = $x->seq($key, $value, R_NEXT) ) { print "$key -> $value\n"; }
      and the result is:
      0
      while I have:
      -rw-r--r-- 1 xxxxx xxx 4.2G Dec 3 14:03 f.db
      as my database. What do I do wrong?

        Try tie'ing without O_CREAT:

        tie %dictf, "DB_File", "f.db", O_RDWR, 0666, $DB_BTREE or die "Cannot +open file f.db: $!\n";

        So, in hindsight, my unlink was unnecessary. And note that the default option is O_CREAT|O_RDWR, so you need to be explicit if you don't want to wipe it out.

      Found this through Google and it helped me filter dupes out of massive files (the %seen hash got too big for memory). Thanks!
      Great but one question is that if I want to do this when you tie hash then it will be written in the local memory? Because reading my file into hash is really RAM consuming.

        Did you read the previous answer or the DB_File documentation?

        DB_File stores the data on disk using the Berkeley DB library, and it uses tie so that it looks like your data was stored in an ordinary hash.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Short answer is no. Try it and see!

Re: storing a large hash in a database
by TJPride (Pilgrim) on Dec 02, 2011 at 16:31 UTC
    What does your data structure do, exactly? I'm not understanding the logic, and a good grasp of the logic is needed before a database structure can be suggested. Can you give a few examples of your dictionary data and what you're searching against it with?
      the dictionary file structure is like this:
      text ||| text ||| number,number,number,number,number
      while I care only about the texts and the third/first number. The dictionary has millions of lines. If I can read it once and write it to the physical disc as a database, I don't need to load it each time in the memory and will do my search with SQL queries.

        It really depends on your data. It appears that you take the first two fields (I'll call them text1 and text2) and use one as the key to a hash, pushing the other onto an array which is the value of that key, and which is the key and which goes onto its array depends on whether a field later in the line is greater than some parameter you supply.

        So you end up with a hash of arrays. Since your hash keys are unique, they could be one field in your database. But we don't know what ends up in your arrays. Do they all have the same length? Are they consistent in what they hold? To map your hash to a SQL table in a one-to-one fashion, the hash values all need to be structured the same.

        As an example, let's say your file looks like this:

        A ||| Adam ||| 1,2,3,4,5 B ||| Bob ||| 1,2,3,4,5 B ||| Brian ||| 1,2,3,4,5 Carl ||| C ||| 3,2,1,4,5 Bruce ||| B ||| 3,2,1,4,5 C ||| Cheryl ||| 1,2,3,4,5

        And you call your sub with a $p value that causes the single-letter values to be used as the keys, with the longer texts pushed onto them as arrays. You'll end up with this hash:

        %hash = ( A => [ 'Adam' ], B => [ 'Bob', 'Brian', 'Bruce' ], C => [ 'Carl', 'Cheryl' ]);

        Here, your hash data doesn't "line up" with columns in a database table, since you have different numbers of fields for each one. So in this case, you'd either need to concatenate your arrays into a single "name" field with a delimiter, or have multiple records for each key. One of the following layouts:

        letter names ------------------------ A Adam B Bob;;Brian;;Bruce C Carl;;Cheryl letter name ------------------------ A Adam B Bob B Brian B Bruce C Carl C Cheryl

        Generally, the second method is best, because it simplifies searches and saves the cost of splitting on the delimiter every time you need a name. But it depends a lot on what your actual data looks like. Your first step should be to think about your data and design a table for it, and then figure out how to insert the values. It doesn't matter that you're putting it in a hash now; think about how it should be structured in terms of set columns and go from there.

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.

        Need to see examples of the data. What text? What number formats? What does it represent? What are you searching against it with? If we're left in the dark it's a waste of time to go crawling around randomly.
Re: storing a large hash in a database
by Anonymous Monk on Dec 02, 2011 at 12:00 UTC
    Start with learning the basics, code samples are available:

    http://oreilly.com/catalog/perldbi/

    http://dbi.perl.org/

    DBIx::Class::Manual

    I don't know why you think asking the monks to do your work is a good idea.

      I did not mean to ask them to work for me. I have a sample code and I have my code running. it was a part of my learning procedure to know about it :)

        What does that mean?

        I have a sample code and I have my code running.

        So you're saying you've already finished this task?

        it was a part of my learning procedure to know about it :)

        About what, what is "it"?