comment on

While it would definitely speed things up to cut out the DB lookups and cache those properties.. overall it sounds like the real problem is the algorithm from a high level perspective. (which.. isn't what you are asking about at all, but maybe you aren't seeing the forest for the trees)

If the only reason you select the elements at random is so that you can get a subset of the whole, why can't you sort the 70k item array by these property things somehow and check pairs in a sequence that is more likely to create pairs above your threshold?

It's difficult because your description of the problem doesn't go into specifics of what these properties are or how many of them there are, etc. But a real cheapo but potentially effective approach might be something along these lines:

Iterate your property-pair score hash and create a hash 'property score', keyed by property only. If the property-pair score is above your threshold, add that to each affiliated property score. If it's below, subtract from them.

This is very arbitrary and only vaguely sensical, but the idea is that doing almost -anything- is likely going to be better then random. You might want to apply tunable coeffecients that affect the weight of add and subtract operations independantly.

Then iterate your 70k elements, assigning them a score which is the sum of all its properties' scores. Sort by that score, and iterate highest scores first.

Since the whole point isn't to provide the absolute best scores, you can tweak and weight the various scores in whatever way you choose depending on what this data is exactly.

This is all based on a potentially wrong assumption that individual properties that are more associated with pairs above the threshold, and less associated with pairs below the threshold, are more likely to make elements be associated with pairs that pass your criteria. For all I know, all the property-pairs perfectly balance each other somehow and your property score ends up being 0 across the board and this approach would have zero effect. This is the sort of thing where knowledge of the data is key.

In reply to Out on a limb.. by vhold
in thread Benchmarking A DB-Intensive Script by bernanke01

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.