Optimization of DB_File sorting and processing code

Evanovich has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: optimize this code?
by dws (Chancellor) on Feb 08, 2002 at 22:45 UTC

I'm wondering if anyone can optimize these loops for me: each cycle takes about a second, ...

You need to separate out the time it's taking to open the file and do file I/O from the time it's taking to interate over the keys and build the matrix.

Do this: Comment out everything between the tie and the untie (missing, but worth adding for general cleanliness). If this doesn't have a significant impact on runtime, then your answer is "buy a faster disk" (or defrag the one you've got).

Next step is to comment out the code that's building the matrix (after doing each %hash). This retains the disk IO for reading keys, while separating out the overhead for your data structures. If this doesn't have a significant effect of overall runtime, answers range from "buy a faster disk" to "precache the keys in the db".

I really doubt this is due to your matrix building code.

[reply]
[d/l]
[select]

Re: Re: optimize this code?

by Evanovich (Scribe) on Feb 09, 2002 at 07:18 UTC

[reply]

Re: Re: Re: optimize this code?

by dws (Chancellor) on Feb 09, 2002 at 08:41 UTC

The slowest thing by far is the matrix building line

That surprises me. Since that's the case, you might consider an alternate matrix representation. Is it important that the column values be in order by key? Will there always be the same number of {key, value} pairs in each of the .db's?

If each .db has the same number of {key, value} pairs, you can build the array in a single large nFile x nKeys vector, which might let you can some creation time at the expense of later access time.

[reply]

Re: Re: Re: Re: optimize this code?

by Evanovich (Scribe) on Feb 09, 2002 at 14:34 UTC

Re: Re: Re: optimize this code?

by dws (Chancellor) on Feb 09, 2002 at 19:43 UTC

$matrix[$j]->[$i]

%matrix

I can't believe that building the data structures is taking that large a percentage of the total time, even for that many cells. This simple test

  my @array;

  my $t = time();
  for my $i ( 0 .. 499 ) {
    for my $j ( 0 .. 6264 ) {
        $array[$i]->[$j] = 47;
    }
  }
  print "Elapsed: ", time() - $t, " seconds\n";
[download]

If you're unable to use a different DBM representation (such as DB_TREE), then you might try pulling keys and values out of the DBM in whatever order it prefers to give them to you, and then sort them by key in memory. By pulling out the keys and sorting before going after the values, you might be trashing around a bit in the file.

[reply]
[d/l]
[select]

Re: Re: Re: Re: optimize this code?

by Evanovich (Scribe) on Feb 11, 2002 at 17:40 UTC

Re: optimize this code?
by japhy (Canon) on Feb 08, 2002 at 22:43 UTC

my @matrix;

while (my $file = glob "*.db") {
  tie my %hash, "DB_File", $file, O_RDONLY, 0644
    or warn("Could not read $file: $!") and next;

  # presize the array the first time
  @matrix = map [], 1 .. keys %hash
    unless @matrix;

  my $i = 0;

  for (sort keys %hash) {
    push @{ $matrix[$i++] }, $hash{$_};
  }
}
[download]

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

[reply]
[d/l]

Re: optimize this code?
by shotgunefx (Parson) on Feb 08, 2002 at 22:34 UTC

[reply]

Re: Re: optimize this code?

by crazyinsomniac (Prior) on Feb 09, 2002 at 14:18 UTC

And now, my favorite activity besides programming, quoting from the pod:

DB_BTREE

The btree format allows arbitrary key/value pairs to be stored in a sorted, balanced binary tree. As with the DB_HASH format, it is possible to provide a user defined Perl routine to perform the comparison of keys. By default, though, the keys are stored in lexical order.

update
as for building the matrix, see what japhy said (it's got to do with how perl grows the array).

The BTree will definetly help, but I'm not sure how noticably ... you might wanna look at the DB_Fille pod for an alternate method of accessing the key/value pairs (like the seq method), or use the BerkleyDB module.

Question: Evanovich, how are you benchmarking this?

Suggestion: If you, with the help of the monks, don't figure out how to optimize this sufficiently, you might wanna consider a C/C++ solution. BerkleyDB has a C/C++ interface, and your code isn't too complex/long ... please don't hurt me ;)(/duck)

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]

Re: Re: Re: optimize this code?

by Evanovich (Scribe) on Feb 09, 2002 at 16:52 UTC

Well, what I don't understand Crazyinsomniac, is why this should take so long, to build the matrix I mean. Accessing the data is lightning fast, it's just that looping through the hash that is rough. Question: if I made my data into single entry vector tables, and then looped through the vector and loaded it into the matrix, maybe that would help? I don't want to go to C, because this code is just the beginning of much more complicated stuff, and I need perl's structures I think.

[reply]

Re: Re: Re: optimize this code?

by Evanovich (Scribe) on Feb 09, 2002 at 14:46 UTC

I commented out the sort, and it doesn't really affect the loop time. Will this help the time of building the matrix, which seems to be taking up the bulk of each loop?

[reply]

Re: optimize this code?
by hossman (Prior) on Feb 08, 2002 at 22:46 UTC

Optimizing ends to require that the optimizer knows what the goal is, otherwise you runthe risk of optimizing away desired side effects.

[reply]