Re: Again Fuzzy regex !!!!

If you really want help with this, then you are going to have to do a lot more than simply throw a gob-load of C code -- that itself has a gob-load of unreferenced & unsupplied dependencies -- at us and expect a conversion service.

The first thing I'd want to know is:

but it was slow

How slow?

Please give comprehensive and accurate numbers. Including:

How big is/are the "long string"(s)?
And how many of them are there to be processed? (Ie. Does a given run consist of all the shorts against a single long or multiple longs?)
How many " short string ( up to 25 letter )"(s) are you comparing against each long string?
What's the minimum length?
And how fuzzy? Ie How many out of a 25 base string have to match?
If the short is shorter than 25, say 15, how many have to match?
How long did your XOR version take to run?
Be precise. How many shorts? How long was the long?
How long did it take. If was single tasking, how many elapsed seconds, and on what processor?
If it was multi-tasking, how many concurrent threads; what processor? How many total cycles/cpu seconds?

I found this code which has been writen was in c but I couldn't handle it

What about it couldn't you handle?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Comment on Re: Again Fuzzy regex !!!!

Replies are listed 'Best First'.
Re^2: Again Fuzzy regex !!!! by samman (Initiate) on May 23, 2015 at 10:32 UTC
Dear BrowserUk thank you very much for your concern : 1- the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds the faster Xor was this : #! perl -slw use strict; use bytes; our $FUZZY \|\|= 4; open KEYS, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my @keys = <KEYS>; close KEYS; chomp @keys; warn "Loaded ${ \scalar @keys } keys"; my $seq ; my $seqnam; readseqfile(); my( $masked, $pos ); my $totalLen = 0; my $count = 0; my $seqLen = length $seq; $totalLen += $seqLen; for my $key ( @keys ) { my $keyLen = length $key; my $mask = $key x ( int( $seqLen / $keyLen ) + 1 ); my $maskLen = length $mask; my $minZeros = chr( 0 ) x int( $keyLen / ( $FUZZY + 1 ) ); my $minZlen = length $minZeros; for my $offset1 ( 0 .. $keyLen-1 ) { $masked = $mask ^ substr( $seq, $offset1, $maskLen ); $pos = 0; while( $pos = 1+index $masked, $minZeros, $pos ) { $pos--; my $offset2 = $pos - ($pos % $keyLen ); last unless $offset1 + $offset2 + $keyLen <= $seqLen; my $fuz = $keyLen - ( substr( $masked, $offset2, $keyLen ) =~ tr[\0] +[\0] ); if( $fuz <= $FUZZY ) { #printf "\tFuzzy matched key:'$key' -v- '%s' in li +ne:" # . "%2d @ %6d (%6d+%6d) with fuzziness: %d\n" +, # substr( $seq, $offset1 + $offset2, $keyLen ), # $., $offset1 + $offset2, $offset1, $offset2, +$fuz; } $pos = $offset2 + $keyLen; } } } warn "\n\nProcessed $. sequences"; warn "Average length: ", $totalLen / $.; sub readseqfile { open( SEQ, "<$ARGV[1]" ); while (<SEQ>) { chomp(); if (/>(\S+)/) { $seqnam = $1; $seq = ""; } else { $seq .= $_; } } close SEQ; } [download] 2- how big the maximum data is the genome of wheat 12 gigabyte every file contains about multiple 30 megabyte data. 3- how short from 10 letters to 25 letter and the missmatch is about 25% percent of the length. 4- as mention the faster Xor above took 10 seconds to just report matches with miss match 4 letters for 18 letter against 30274277. 5- " what about it couldn't you handle " : A- about Xor : yes I could but the things that I notice that when the code slice the genome "target" for comparison it took too much " about 1 second `$masked = $mask ^ substr( $seq, $offset1, $maskLen );` [download] B- about the c code I only want to know what exactly the method he used and If there is modules like it in perl and if I can take only the part of fuzzysearch (library/code) in this code and integrate it in my code as an outside part. why his code is faster . # yes I looked in all how asked the same question like me but does really this is what perl can offer the half of the speed of c or ther is more	[reply] [d/l] [select]
Re^3: Again Fuzzy regex !!!! by BrowserUk (Patriarch) on May 23, 2015 at 19:38 UTC
the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds Yes. That is about as good as you will get from pure Perl code. C will usually be faster. There is no point working out what algorithm the C code is using because if you reimplemented it in Perl is would be much slower. Perl is very poor at handling strings on a byte-by-byte basis -- you need to call a function (substr) to get at each and every byte; whereas C only need increment an address register. My XOR code that you've reposted above plays to Perl's strengths by using single op-codes on the long strings to perform the majority of the processing; but in the end you need to call substr many times to extract the matches and substr is very slow. If you need to stick to Perl, what you have is about as good as it is likely to get. If you need faster, then you'll have to bite the bullet and learn C. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]