in reply to Fuzzy Searching: Optimizing Algorithm Selection
If you XOR two strings together,
print join'',unpack 'C*', 'AGGCGGAAGCACCCAACAGCAACAG' ^ 'ACACGCAAAAACCGAAGAGAAAGCG'; 0460040062000400400200420
The resultant string will be the null character (chr( 0 )) in each position where the two strings match and some other char where they did not.
If you count the non-null bytes, it gives you a score for the fuzziness of the match.
print $_ = grep $_,unpack 'C*', 'AGGCGGAAGCACCCAACAGCAACAG' ^ 'ACACGCAAAAACCGAAGAGAAAGCG'; 10
To avoid having to process the large file of sequences multiple times, it is better to process all of the 25-ers against each sequence in turn. By using substr to step down the sequence, you can perform the XOR against each of the 25-ers for each position of the sequence.
Putting that all together, I got:
#! perl -slw use strict; use bytes; $| = 1; our $FUZZY ||= 10; open FUZ, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my %fuz; $fuz{ $_ } = '' while chomp( $_ = <FUZ> ); close FUZ; print "Loaded ${ \scalar keys %fuz } 25-ers"; open SEQ, '< :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; while( my $seq = <SEQ> ) { chomp $seq; for my $offset ( 0 .. length( $seq ) - 25 ) { printf "\rProcessing sequence %5d offset %05d", $., $offset; for my $fuz ( keys %fuz ) { my $m = grep $_, unpack 'C*', $fuz ^ substr( $seq, $offse +t, 25 ); if( $m <= $FUZZY ) { ## This stores the lineno/offset/fuzziness where each +25-er ## matched in a compact form for further process; sor +ting etc. $fuz{ $fuz } .= pack 'nnn', $., $offset, $m; ## Or just print out the data to a file. # print "\nMatched '$fuz' -v- '", # substr( $seq, $offset, 25 ), # "' in line: $. @ $offset with fuzziness: ", $m; } } } } close SEQ;
This process is currently running 500,000 25-ers against a file 30,000 x ave. 1000 characters (500 - 1500) sequences (all randomly generated), at the approximate rate of 3.5 seconds per offset, whilst recording the lineno/offset/fuzziness for all matches where the fuzziness is 10 or less. The advantage may be that this process will give you an accurate fuzziness factor for every 25-er matched against every position in a single pass.
I don't have any feel for how this compares to using the regex engine to perform the matching, but I think it may run quicker.
If you could indicate how many 100s of 1000s of 25's you are playing with, plus the average length of the 30,000 sequences, it would be possible to calculate an approximate time for the process that you could then compare with other approaches?
|
|---|