If you XOR two strings together,

print join'',unpack 'C*', 'AGGCGGAAGCACCCAACAGCAACAG' ^ 'ACACGCAAAAACCGAAGAGAAAGCG'; 0460040062000400400200420

The resultant string will be the null character (chr( 0 )) in each position where the two strings match and some other char where they did not.

If you count the non-null bytes, it gives you a score for the fuzziness of the match.

print $_ = grep $_,unpack 'C*', 'AGGCGGAAGCACCCAACAGCAACAG' ^ 'ACACGCAAAAACCGAAGAGAAAGCG'; 10

To avoid having to process the large file of sequences multiple times, it is better to process all of the 25-ers against each sequence in turn. By using substr to step down the sequence, you can perform the XOR against each of the 25-ers for each position of the sequence.

Putting that all together, I got:

#! perl -slw use strict; use bytes; $| = 1; our $FUZZY ||= 10; open FUZ, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my %fuz; $fuz{ $_ } = '' while chomp( $_ = <FUZ> ); close FUZ; print "Loaded ${ \scalar keys %fuz } 25-ers"; open SEQ, '< :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; while( my $seq = <SEQ> ) { chomp $seq; for my $offset ( 0 .. length( $seq ) - 25 ) { printf "\rProcessing sequence %5d offset %05d", $., $offset; for my $fuz ( keys %fuz ) { my $m = grep $_, unpack 'C*', $fuz ^ substr( $seq, $offse +t, 25 ); if( $m <= $FUZZY ) { ## This stores the lineno/offset/fuzziness where each +25-er ## matched in a compact form for further process; sor +ting etc. $fuz{ $fuz } .= pack 'nnn', $., $offset, $m; ## Or just print out the data to a file. # print "\nMatched '$fuz' -v- '", # substr( $seq, $offset, 25 ), # "' in line: $. @ $offset with fuzziness: ", $m; } } } } close SEQ;

This process is currently running 500,000 25-ers against a file 30,000 x ave. 1000 characters (500 - 1500) sequences (all randomly generated), at the approximate rate of 3.5 seconds per offset, whilst recording the lineno/offset/fuzziness for all matches where the fuzziness is 10 or less. The advantage may be that this process will give you an accurate fuzziness factor for every 25-er matched against every position in a single pass.

I don't have any feel for how this compares to using the regex engine to perform the matching, but I think it may run quicker.

If you could indicate how many 100s of 1000s of 25's you are playing with, plus the average length of the 30,000 sequences, it would be possible to calculate an approximate time for the process that you could then compare with other approaches?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.