in reply to Re^2: Benchmarking A DB-Intensive Script
in thread Benchmarking A DB-Intensive Script

I don't think thundergnat is suggesting you shouldn't select random items, but that you not re-use the same random pool each time.

If you select random keys from a hash, then delete the keys you pick, you'll progressively reduce your options and never hit a repeat. It doesn't seem significant to your use that you allow for re-selection of previously examined items, so remove them from the pool of future options. Otherwise it will become harder and take longer to find good pairs, instead of quicker.

  • Comment on Re^3: Benchmarking A DB-Intensive Script

Replies are listed 'Best First'.
Re^4: Benchmarking A DB-Intensive Script
by tilly (Archbishop) on Mar 14, 2006 at 22:29 UTC
    And I think that thundergnat is wrong.

    With the random strategy, the odds of repeating the select go from 0% to 4%. So, you're going to need to pull about 102 million pairs by the time you're done. (Slightly higher than that, but below 104 million.) That's 102 million hash lookups following fast operations.

    The alternate strategy involves doing 2.45 billion hash lookups just to start. That preprocessing step is more than the 2 million redos that you're hoping to save with a better algorithm.

    Random then redo is just fine for his needs. It would only be worth revisiting that plan if he was going to be searching a large part of his search space.

      I misread "check that this pair has not yet been processed," and agree that pre-processing all pairs is not a good approach- I had thought you could remove individual pairs.

      But if he did take that approach, he would be able to say "my script will finish," which you can't do if you retain known invalid pairs. It could sit there pulling the same candidates, if it is truly random... but then, Eris needs her toys.

      And just because one of my friends recently said he'd never seen it posted on a forum, I feel obliged to throw this out here too: I was wrong. :)
Re^4: Benchmarking A DB-Intensive Script
by bernanke01 (Beadle) on Mar 14, 2006 at 21:45 UTC

    Ahh, I think I see it now. So the idea would be to store all possible pairs (e.g. 2.5 billion) in a hash, then delete each pair after its been selected? That would be pretty memory intensive at the beginning, but peter-out by the end.

    Given the testing density (100 million out of 2450 million = ~4%) perhaps the slowdown at the end won't be too considerable -- only one duplicate hit for every 24 new ones.

      No. Just store the individuals, tossing ones that will never become part of a pair as you store...

      Ah. I misread "check that this pair has not yet been processed" to mean each item was only a candidate for an initial pairing. I now see that, as I put it, you would have to store pairs... and that's way more work than is necessary unless you need to test for many more.

      *goes to sleep*