in reply to Optimization of DB_File sorting and processing code

I'm wondering if anyone can optimize these loops for me: each cycle takes about a second, ...

You need to separate out the time it's taking to open the file and do file I/O from the time it's taking to interate over the keys and build the matrix.

Do this: Comment out everything between the tie and the untie (missing, but worth adding for general cleanliness). If this doesn't have a significant impact on runtime, then your answer is "buy a faster disk" (or defrag the one you've got).

Next step is to comment out the code that's building the matrix (after doing each %hash). This retains the disk IO for reading keys, while separating out the overhead for your data structures. If this doesn't have a significant effect of overall runtime, answers range from "buy a faster disk" to "precache the keys in the db".

I really doubt this is due to your matrix building code.

Replies are listed 'Best First'.
Re: Re: optimize this code?
by Evanovich (Scribe) on Feb 09, 2002 at 07:18 UTC
    Hi dws-- I commented out individual snippets of this code. The slowest thing by far is the matrix building line, $matrix$j->$i = $hash{$key}; Also somewhat slow is the looping through the hash. The loading of the data is completely quick. So: does anyone know a better way to load values more quickly into a large matrix?
      The slowest thing by far is the matrix building line

      That surprises me. Since that's the case, you might consider an alternate matrix representation. Is it important that the column values be in order by key? Will there always be the same number of {key, value} pairs in each of the .db's?

      If each .db has the same number of {key, value} pairs, you can build the array in a single large nFile x nKeys vector, which might let you can some creation time at the expense of later access time.

        Yes, each hash has exactly 6265 entries. The matrix will always have that many rows: the number of columns will change though depending on the situation. Can you elaborate on this alternate matrix representation? I don't know what that means. Thanks so much for your help!
      After sleeping on this, I realized the obvious, which is that each $matrix[$j]->[$i] represents potential file I/O, since %matrix is tied to a DBM. Duh.

      I can't believe that building the data structures is taking that large a percentage of the total time, even for that many cells. This simple test

      my @array; my $t = time(); for my $i ( 0 .. 499 ) { for my $j ( 0 .. 6264 ) { $array[$i]->[$j] = 47; } } print "Elapsed: ", time() - $t, " seconds\n";
      ran in 17 seconds on my 400Mhz laptop, so it has to be the disk I/O that's killing you.

      If you're unable to use a different DBM representation (such as DB_TREE), then you might try pulling keys and values out of the DBM in whatever order it prefers to give them to you, and then sort them by key in memory. By pulling out the keys and sorting before going after the values, you might be trashing around a bit in the file.

        Hi dws--I did your test on my computer, and you're right, it only took eight seconds. Would you mind elaborating a bit about how to pull keys and values out of the DBM? Maybe a little bit of example code would make it easier to understand.... Thanks so much, Evan