in reply to Benchmarking A DB-Intensive Script

70k items is not that large for an array. Depending on how many properties are attatched, you probably could hold the whole thing in memory at once. It would be worth trying since that will likely speed up your processing by orders of magnitude. Filter out any element with no properties as you load the array.

Don't just select pairs at random and then check to see if they've already been done. That gets very inefficienct after you checked half of the array. Iterate over it instead and only check each pair once. Select the elements with a set of nested loops.

Replies are listed 'Best First'.
Re^2: Benchmarking A DB-Intensive Script
by bernanke01 (Beadle) on Mar 14, 2006 at 20:06 UTC

    Hi thundergnat,

    Right, but I am looking for pairs from the 70k array, so that leaves me 70k x 70k / 2 = 2.4 billion pairs, of which I probably only need about 100 million (about 4%) so I really do need to get them randomly.

    But you're definitely right that holding the property-lists in memory is the obvious place to start tuning. Just wanted to figure out first if that's truly where the script is spending all its time!

      I don't think thundergnat is suggesting you shouldn't select random items, but that you not re-use the same random pool each time.

      If you select random keys from a hash, then delete the keys you pick, you'll progressively reduce your options and never hit a repeat. It doesn't seem significant to your use that you allow for re-selection of previously examined items, so remove them from the pool of future options. Otherwise it will become harder and take longer to find good pairs, instead of quicker.

        And I think that thundergnat is wrong.

        With the random strategy, the odds of repeating the select go from 0% to 4%. So, you're going to need to pull about 102 million pairs by the time you're done. (Slightly higher than that, but below 104 million.) That's 102 million hash lookups following fast operations.

        The alternate strategy involves doing 2.45 billion hash lookups just to start. That preprocessing step is more than the 2 million redos that you're hoping to save with a better algorithm.

        Random then redo is just fine for his needs. It would only be worth revisiting that plan if he was going to be searching a large part of his search space.

        Ahh, I think I see it now. So the idea would be to store all possible pairs (e.g. 2.5 billion) in a hash, then delete each pair after its been selected? That would be pretty memory intensive at the beginning, but peter-out by the end.

        Given the testing density (100 million out of 2450 million = ~4%) perhaps the slowdown at the end won't be too considerable -- only one duplicate hit for every 24 new ones.