Re^2: Fuzzy Searching: Optimizing Algorithm Selection

At 3.5 seconds per offset it will complete in roughly 3 years ;-) Nonetheless XOR works quickly in Perl. A small optimisation is to take the substr outside the inner keys loop as you are doing it 499,999 more times than required per offset.

Real sample data would be nice. The benefits from indexing depend on the alphabet size and the number of find strings.

cheers

tachyon

Comment on Re^2: Fuzzy Searching: Optimizing Algorithm Selection

Replies are listed 'Best First'.
Re^3: Fuzzy Searching: Optimizing Algorithm Selection by BrowserUk (Patriarch) on Nov 11, 2004 at 03:57 UTC
Update: Indeed. The two optimisations reduce the 500,000 comparisons time from ~3.5 seconds to + .5 of a second. That reduces the projected overall runtime of my test scenario from 3+ years to under half a year. Worth doing :) Yes. I performed the same calculation based upon my decision to use 500,000 25-ers. Hence my decision to ask for clarification. There are several ways to speed up the processing. Using `( substr( $seq, $offset, 25 ) ^ $25er ) =~ tr[\0][\0];` [download] to do the counting, rather than `grep $_, unpack 'C*, ...` [download] is (probably) another. I'm just waiting for a limited benchmark I have running to complete before trying several things. I had thought that by avoiding actually copying each 25 char string out of the sequence I might save some time/memory, but now you've pointed it out, I realise that I can create an LVALUE ref to the substring and avoid 500,000 calls to substr for each inner loop. Thanks. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Fuzzy Searching: Optimizing Algorithm Selection
by BrowserUk (Patriarch) on Nov 11, 2004 at 03:57 UTC

Update: Indeed. The two optimisations reduce the 500,000 comparisons time from ~3.5 seconds to + .5 of a second. That reduces the projected overall runtime of my test scenario from 3+ years to under half a year. Worth doing :)

Yes. I performed the same calculation based upon my decision to use 500,000 25-ers. Hence my decision to ask for clarification.

There are several ways to speed up the processing. Using

( substr( $seq, $offset, 25 ) ^ $25er ) =~ tr[\0][\0];
[download]

to do the counting, rather than

grep $_, unpack 'C*, ...
[download]

is (probably) another.

I'm just waiting for a limited benchmark I have running to complete before trying several things.

I had thought that by avoiding actually copying each 25 char string out of the sequence I might save some time/memory, but now you've pointed it out, I realise that I can create an LVALUE ref to the substring and avoid 500,000 calls to substr for each inner loop. Thanks.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

[reply]
[d/l]
[select]