Dear BrowserUk
thank you very much for your concern :
1- the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds the faster Xor was this :
#! perl -slw
use strict;
use bytes;
our $FUZZY ||= 4;
open KEYS, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";
my @keys = <KEYS>;
close KEYS;
chomp @keys;
warn "Loaded ${ \scalar @keys } keys";
my $seq ;
my $seqnam;
readseqfile();
my( $masked, $pos );
my $totalLen = 0;
my $count = 0;
my $seqLen = length $seq;
$totalLen += $seqLen;
for my $key ( @keys ) {
my $keyLen = length $key;
my $mask = $key x ( int( $seqLen / $keyLen ) + 1 );
my $maskLen = length $mask;
my $minZeros = chr( 0 ) x int( $keyLen / ( $FUZZY + 1 ) );
my $minZlen = length $minZeros;
for my $offset1 ( 0 .. $keyLen-1 ) {
$masked = $mask ^ substr( $seq, $offset1, $maskLen );
$pos = 0;
while(
$pos = 1+index $masked, $minZeros, $pos
) {
$pos--;
my $offset2 = $pos - ($pos % $keyLen );
last unless $offset1 + $offset2 + $keyLen <= $seqLen;
my $fuz = $keyLen
- ( substr( $masked, $offset2, $keyLen ) =~ tr[\0]
+[\0] );
if( $fuz <= $FUZZY ) {
#printf "\tFuzzy matched key:'$key' -v- '%s' in li
+ne:"
# . "%2d @ %6d (%6d+%6d) with fuzziness: %d\n"
+,
# substr( $seq, $offset1 + $offset2, $keyLen ),
# $., $offset1 + $offset2, $offset1, $offset2,
+$fuz;
}
$pos = $offset2 + $keyLen;
}
}
}
warn "\n\nProcessed $. sequences";
warn "Average length: ", $totalLen / $.;
sub readseqfile {
open( SEQ, "<$ARGV[1]" );
while (<SEQ>) {
chomp();
if (/>(\S+)/) {
$seqnam = $1;
$seq = "";
}
else {
$seq .= $_;
}
}
close SEQ;
}
2- how big the maximum data is the genome of wheat 12 gigabyte every file contains about multiple 30 megabyte data.
3- how short from 10 letters to 25 letter and the missmatch is about 25% percent of the length.
4- as mention the faster Xor above took 10 seconds to just report matches with miss match 4 letters for 18 letter against 30274277.
5- " what about it couldn't you handle " :
A- about Xor : yes I could but the things that I notice that when the code slice the genome "target" for comparison it took too much " about 1 second
$masked = $mask ^ substr( $seq, $offset1, $maskLen );
B- about the c code I only want to know what exactly the method he used and If there is modules like it in perl
and if I can take only the part of fuzzysearch (library/code) in this code and integrate it in my code as an outside part.
why his code is faster .
# yes I looked in all how asked the same question like me but does really this is what perl can offer the half of the speed of c or ther is more
|