in reply to Re: Again Fuzzy regex !!!!
in thread Again Fuzzy regex !!!!

Dear BrowserUk thank you very much for your concern : 1- the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds the faster Xor was this :

#! perl -slw use strict; use bytes; our $FUZZY ||= 4; open KEYS, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my @keys = <KEYS>; close KEYS; chomp @keys; warn "Loaded ${ \scalar @keys } keys"; my $seq ; my $seqnam; readseqfile(); my( $masked, $pos ); my $totalLen = 0; my $count = 0; my $seqLen = length $seq; $totalLen += $seqLen; for my $key ( @keys ) { my $keyLen = length $key; my $mask = $key x ( int( $seqLen / $keyLen ) + 1 ); my $maskLen = length $mask; my $minZeros = chr( 0 ) x int( $keyLen / ( $FUZZY + 1 ) ); my $minZlen = length $minZeros; for my $offset1 ( 0 .. $keyLen-1 ) { $masked = $mask ^ substr( $seq, $offset1, $maskLen ); $pos = 0; while( $pos = 1+index $masked, $minZeros, $pos ) { $pos--; my $offset2 = $pos - ($pos % $keyLen ); last unless $offset1 + $offset2 + $keyLen <= $seqLen; my $fuz = $keyLen - ( substr( $masked, $offset2, $keyLen ) =~ tr[\0] +[\0] ); if( $fuz <= $FUZZY ) { #printf "\tFuzzy matched key:'$key' -v- '%s' in li +ne:" # . "%2d @ %6d (%6d+%6d) with fuzziness: %d\n" +, # substr( $seq, $offset1 + $offset2, $keyLen ), # $., $offset1 + $offset2, $offset1, $offset2, +$fuz; } $pos = $offset2 + $keyLen; } } } warn "\n\nProcessed $. sequences"; warn "Average length: ", $totalLen / $.; sub readseqfile { open( SEQ, "<$ARGV[1]" ); while (<SEQ>) { chomp(); if (/>(\S+)/) { $seqnam = $1; $seq = ""; } else { $seq .= $_; } } close SEQ; }
2- how big the maximum data is the genome of wheat 12 gigabyte every file contains about multiple 30 megabyte data. 3- how short from 10 letters to 25 letter and the missmatch is about 25% percent of the length. 4- as mention the faster Xor above took 10 seconds to just report matches with miss match 4 letters for 18 letter against 30274277. 5- " what about it couldn't you handle " : A- about Xor : yes I could but the things that I notice that when the code slice the genome "target" for comparison it took too much " about 1 second
$masked = $mask ^ substr( $seq, $offset1, $maskLen );
B- about the c code I only want to know what exactly the method he used and If there is modules like it in perl and if I can take only the part of fuzzysearch (library/code) in this code and integrate it in my code as an outside part. why his code is faster . # yes I looked in all how asked the same question like me but does really this is what perl can offer the half of the speed of c or ther is more

Replies are listed 'Best First'.
Re^3: Again Fuzzy regex !!!!
by BrowserUk (Patriarch) on May 23, 2015 at 19:38 UTC
    the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds

    Yes. That is about as good as you will get from pure Perl code. C will usually be faster.

    There is no point working out what algorithm the C code is using because if you reimplemented it in Perl is would be much slower.

    Perl is very poor at handling strings on a byte-by-byte basis -- you need to call a function (substr) to get at each and every byte; whereas C only need increment an address register.

    My XOR code that you've reposted above plays to Perl's strengths by using single op-codes on the long strings to perform the majority of the processing; but in the end you need to call substr many times to extract the matches and substr is very slow.

    If you need to stick to Perl, what you have is about as good as it is likely to get. If you need faster, then you'll have to bite the bullet and learn C.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked