in reply to Re: optimize this code?
in thread Optimization of DB_File sorting and processing code

Hi dws-- I commented out individual snippets of this code. The slowest thing by far is the matrix building line, $matrix$j->$i = $hash{$key}; Also somewhat slow is the looping through the hash. The loading of the data is completely quick. So: does anyone know a better way to load values more quickly into a large matrix?

Replies are listed 'Best First'.
Re: Re: Re: optimize this code?
by dws (Chancellor) on Feb 09, 2002 at 08:41 UTC
    The slowest thing by far is the matrix building line

    That surprises me. Since that's the case, you might consider an alternate matrix representation. Is it important that the column values be in order by key? Will there always be the same number of {key, value} pairs in each of the .db's?

    If each .db has the same number of {key, value} pairs, you can build the array in a single large nFile x nKeys vector, which might let you can some creation time at the expense of later access time.

      Yes, each hash has exactly 6265 entries. The matrix will always have that many rows: the number of columns will change though depending on the situation. Can you elaborate on this alternate matrix representation? I don't know what that means. Thanks so much for your help!
Re: Re: Re: optimize this code?
by dws (Chancellor) on Feb 09, 2002 at 19:43 UTC
    After sleeping on this, I realized the obvious, which is that each $matrix[$j]->[$i] represents potential file I/O, since %matrix is tied to a DBM. Duh.

    I can't believe that building the data structures is taking that large a percentage of the total time, even for that many cells. This simple test

    my @array; my $t = time(); for my $i ( 0 .. 499 ) { for my $j ( 0 .. 6264 ) { $array[$i]->[$j] = 47; } } print "Elapsed: ", time() - $t, " seconds\n";
    ran in 17 seconds on my 400Mhz laptop, so it has to be the disk I/O that's killing you.

    If you're unable to use a different DBM representation (such as DB_TREE), then you might try pulling keys and values out of the DBM in whatever order it prefers to give them to you, and then sort them by key in memory. By pulling out the keys and sorting before going after the values, you might be trashing around a bit in the file.

      Hi dws--I did your test on my computer, and you're right, it only took eight seconds. Would you mind elaborating a bit about how to pull keys and values out of the DBM? Maybe a little bit of example code would make it easier to understand.... Thanks so much, Evan