Dear BrowserUk thank you very much for your concern : 1- the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds the faster Xor was this :
2- how big the maximum data is the genome of wheat 12 gigabyte every file contains about multiple 30 megabyte data. 3- how short from 10 letters to 25 letter and the missmatch is about 25% percent of the length. 4- as mention the faster Xor above took 10 seconds to just report matches with miss match 4 letters for 18 letter against 30274277. 5- " what about it couldn't you handle " : A- about Xor : yes I could but the things that I notice that when the code slice the genome "target" for comparison it took too much " about 1 second#! perl -slw use strict; use bytes; our $FUZZY ||= 4; open KEYS, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my @keys = <KEYS>; close KEYS; chomp @keys; warn "Loaded ${ \scalar @keys } keys"; my $seq ; my $seqnam; readseqfile(); my( $masked, $pos ); my $totalLen = 0; my $count = 0; my $seqLen = length $seq; $totalLen += $seqLen; for my $key ( @keys ) { my $keyLen = length $key; my $mask = $key x ( int( $seqLen / $keyLen ) + 1 ); my $maskLen = length $mask; my $minZeros = chr( 0 ) x int( $keyLen / ( $FUZZY + 1 ) ); my $minZlen = length $minZeros; for my $offset1 ( 0 .. $keyLen-1 ) { $masked = $mask ^ substr( $seq, $offset1, $maskLen ); $pos = 0; while( $pos = 1+index $masked, $minZeros, $pos ) { $pos--; my $offset2 = $pos - ($pos % $keyLen ); last unless $offset1 + $offset2 + $keyLen <= $seqLen; my $fuz = $keyLen - ( substr( $masked, $offset2, $keyLen ) =~ tr[\0] +[\0] ); if( $fuz <= $FUZZY ) { #printf "\tFuzzy matched key:'$key' -v- '%s' in li +ne:" # . "%2d @ %6d (%6d+%6d) with fuzziness: %d\n" +, # substr( $seq, $offset1 + $offset2, $keyLen ), # $., $offset1 + $offset2, $offset1, $offset2, +$fuz; } $pos = $offset2 + $keyLen; } } } warn "\n\nProcessed $. sequences"; warn "Average length: ", $totalLen / $.; sub readseqfile { open( SEQ, "<$ARGV[1]" ); while (<SEQ>) { chomp(); if (/>(\S+)/) { $seqnam = $1; $seq = ""; } else { $seq .= $_; } } close SEQ; }
B- about the c code I only want to know what exactly the method he used and If there is modules like it in perl and if I can take only the part of fuzzysearch (library/code) in this code and integrate it in my code as an outside part. why his code is faster . # yes I looked in all how asked the same question like me but does really this is what perl can offer the half of the speed of c or ther is more$masked = $mask ^ substr( $seq, $offset1, $maskLen );
In reply to Re^2: Again Fuzzy regex !!!!
by samman
in thread Again Fuzzy regex !!!!
by samman
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |