samman has asked for the wisdom of the Perl Monks concerning the following question:

Help Me perl Monks !!!!!! I know that you have asked this question a million times before but I didn't find an answer full my needs > I want to fuzzy search a short string ( up to 25 letter ) against a long string (very very long ) with miss match >> Yes I tried Xor :

my $str = $Genome; my $match = "ATCGACTACGACTCGACT"; my $len = length($match); for my $i ( 0 .. length($str) - $len ) { my $s = substr $str, $i, $len; my $x = $s ^ $match; $x =~ tr/\0//d; # equal characters xor to \0 print "'$s' at offset $i matched\t".length($x)."\n" unles +s length($x) > 5; }

but it was slow but I found this code which has been writen was in c but I couldn't handle it , to extract what he did and rewrite it in perl, I want to do the same like what he did but in perl >>> as it was faster , could you please tell me how to do like it in perl !!!!!!!

/* @source primersearch application ** ** Searches a set of primer pairs against a set of DNA sequences in bo +th ** forward and reverse sense. ** Modification of fuzznuc. ** @author Copyright (C) Val Curwen (vac@sanger.ac.uk) ** @@ ** ** This program is free software; you can redistribute it and/or ** modify it under the terms of the GNU General Public License ** as published by the Free Software Foundation; either version 2 ** of the License, or (at your option) any later version. ** ** This program is distributed in the hope that it will be useful, ** but WITHOUT ANY WARRANTY; without even the implied warranty of ** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ** GNU General Public License for more details. ** ** You should have received a copy of the GNU General Public License ** along with this program; if not, write to the Free Software ** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-13 +07, USA. ********************************************************************** +********/ #include "emboss.h" #include <stdlib.h> #include <string.h> #include <stdio.h> #define AJA2 50 /* @datastatic PGuts ************************************************* +********* ** ** the internals of a primer; each Primer has two of these, ** one forward and one reverse ** ** @alias primerguts ** ** @attr patstr [AjPStr] Undocumented ** @attr origpat [AjPStr] Undocumented ** @attr type [ajuint] Undocumented ** @attr len [ajuint] Undocumented ** @attr real_len [ajuint] Undocumented ** @attr amino [AjBool] Undocumented ** @attr carboxyl [AjBool] Undocumented ** @attr mm [ajuint] Undocumented ** @attr buf [ajint*] Undocumented ** @attr sotable [ajuint*] Undocumented ** @attr off [EmbOPatBYPNode[AJALPHA]] Undocumented ** @attr re [AjPStr] Undocumented ** @attr skipm [ajuint**] Undocumented ** @attr tidy [const void*] Undocumented ** @attr solimit [ajuint] Undocumented ** @attr Padding [char[4]] Padding to alignment boundary ********************************************************************** +********/ typedef struct primerguts { AjPStr patstr; AjPStr origpat; ajuint type; ajuint len; ajuint real_len; AjBool amino; AjBool carboxyl; ajuint mm; ajint* buf; ajuint* sotable; EmbOPatBYPNode off[AJALPHA]; AjPStr re; ajuint** skipm; const void* tidy; ajuint solimit; char Padding[4]; } *PGuts; /* @datastatic PHit ************************************************** +********* ** ** holds details of a hit against a sequence ie this primer will ampli +fy ** ** @alias primerhit ** ** @attr seqname [AjPStr] Undocumented ** @attr desc [AjPStr] Undocumented ** @attr acc [AjPStr] Undocumented ** @attr forward [AjPStr] pattern that hits forward strand ** @attr reverse [AjPStr] pattern that hits reverse strand ** @attr forward_pos [ajuint] Undocumented ** @attr reverse_pos [ajuint] Undocumented ** @attr amplen [ajuint] Undocumented ** @attr forward_mismatch [ajuint] Undocumented ** @attr reverse_mismatch [ajuint] Undocumented ** @attr Padding [char[4]] Padding to alignment boundary ********************************************************************** +********/ typedef struct primerhit { AjPStr seqname; AjPStr desc; AjPStr acc; AjPStr forward; AjPStr reverse; ajuint forward_pos; ajuint reverse_pos; ajuint amplen; ajuint forward_mismatch; ajuint reverse_mismatch; char Padding[4]; } *PHit; /* @datastatic Primer ************************************************ +********* ** ** primer pairs will be read into a list of these structs ** ** @alias primers ** ** @attr Name [AjPStr] Undocumented ** @attr forward [PGuts] Undocumented ** @attr reverse [PGuts] Undocumented ** @attr hitlist [AjPList] Undocumented ********************************************************************** +********/ typedef struct primers { AjPStr Name; PGuts forward; PGuts reverse; AjPList hitlist; } *Primer; /* "constructors" */ static void primersearch_initialise_pguts(PGuts* primer); /* "destructors" */ static void primersearch_free_pguts(PGuts* primer); static void primersearch_free_primer(void** x, void* cl); static void primersearch_clean_hitlist(AjPList* hlist); /* utilities */ static void primersearch_read_primers(AjPList* primerList, AjPFile pri +merFile, ajint mmp); static AjBool primersearch_classify_and_compile(Primer* primdata); static void primersearch_primer_search(const AjPList primerList, const AjPSeq seq); static void primersearch_scan_seq(const Primer primdata, const AjPSeq seq, AjBool reverse); static void primersearch_store_hits(const Primer primdata, AjPList fhi +ts_list, AjPList rhits_list, const AjPSeq seq, AjBool reverse); static void primersearch_print_hits(const AjPList primerList, AjPFile +outf); /* @prog primersearch ************************************************ +********* ** ** Searches DNA sequences for matches with primer pairs ** ********************************************************************** +********/ int main(int argc, char **argv) { AjPSeqall seqall; AjPSeq seq = NULL; AjPFile primerFile; /* read the primer pairs from a file +*/ AjPFile outf; AjPList primerList; ajint mmp = 0; embInit("primersearch", argc, argv); seqall = ajAcdGetSeqall("seqall"); outf = ajAcdGetOutfile("outfile"); primerFile = ajAcdGetInfile("infile"); mmp = ajAcdGetInt("mismatchpercent"); /* build list of forward/reverse primer pairs as read from primerf +ile */ primerList = ajListNew(); /* read in primers from primerfile, classify and compile them */ primersearch_read_primers(&primerList,primerFile, mmp); /* check there are primers to be searched */ if(!ajListGetLength(primerList)) { ajErr("No suitable primers found - exiting"); embExitBad(); return 0; } /* query sequences one by one */ while(ajSeqallNext(seqall,&seq)) primersearch_primer_search(primerList, seq); /* output the results */ primersearch_print_hits(primerList, outf); /* delete all nodes of list, then the list itself */ ajListMap(primerList, primersearch_free_primer, NULL); ajListFree(&primerList); ajListFree(&primerList); ajFileClose(&outf); ajSeqallDel(&seqall); ajSeqDel(&seq); ajFileClose(&primerFile); embExit(); return 0; } /* "constructors" */ /* @funcstatic primersearch_initialise_pguts ************************* +********* ** ** Initialise primer guts ** ** @param [w] primer [PGuts*] primer guts ** @@ ********************************************************************** +********/ static void primersearch_initialise_pguts(PGuts* primer) { AJNEW(*primer); (*primer)->patstr = NULL; (*primer)->origpat = ajStrNew(); (*primer)->type = 0; (*primer)->len = 0; (*primer)->real_len = 0; (*primer)->re = NULL; (*primer)->amino = 0; (*primer)->carboxyl = 0; (*primer)->tidy = NULL; (*primer)->mm = 0; (*primer)->buf = NULL; (*primer)->sotable = NULL; (*primer)->solimit = 0; (*primer)->re = NULL; (*primer)->skipm = NULL; return; } /* "destructors" */ /* @funcstatic primersearch_free_pguts ******************************* +********* ** ** Frees up all the internal members of a PGuts struct ** ** @param [d] primer [PGuts*] Undocumented ** @return [void] ** @@ ********************************************************************** +********/ static void primersearch_free_pguts(PGuts* primer) { ajuint i = 0; ajStrDel(&(*primer)->patstr); ajStrDel(&(*primer)->origpat); ajStrDel(&(*primer)->re); if(((*primer)->type==1 || (*primer)->type==2) && ((*primer)->buf)) free((*primer)->buf); if(((*primer)->type==3 || (*primer)->type==4) && ((*primer)->sotab +le)) free((*primer)->sotable); if((*primer)->type==6) for(i=0;i<(*primer)->real_len;++i) AJFREE((*primer)->skipm[i]); AJFREE(*primer); return; } /* @funcstatic primersearch_free_primer ****************************** +********* ** ** frees up the internal members of a Primer ** ** @param [r] x [void**] Undocumented ** @param [r] cl [void*] Undocumented ** @@ ********************************************************************** +********/ static void primersearch_free_primer(void **x, void *cl) { Primer* p; Primer primdata; AjIList lIter; (void) cl; /* make it used */ p = (Primer*) x; primdata = *p; primersearch_free_pguts(&primdata->forward); primersearch_free_pguts(&primdata->reverse); ajStrDel(&primdata->Name); /* clean up hitlist */ lIter = ajListIterNewread(primdata->hitlist); while(!ajListIterDone(lIter)) { PHit phit = ajListIterGet(lIter); ajStrDel(&phit->forward); ajStrDel(&phit->reverse); ajStrDel(&phit->seqname); ajStrDel(&phit->acc); ajStrDel(&phit->desc); AJFREE(phit); } ajListFree(&primdata->hitlist); ajListFree(&primdata->hitlist); ajListIterDel(&lIter); AJFREE(primdata); return; } /* @funcstatic primersearch_clean_hitlist **************************** +********* ** ** Clean the hitlist ** ** @param [d] hlist [AjPList*] Undocumented ** @@ ********************************************************************** +********/ static void primersearch_clean_hitlist(AjPList* hlist) { AjIList lIter; lIter = ajListIterNewread(*hlist); while(!ajListIterDone(lIter)) { EmbPMatMatch fm = ajListIterGet(lIter); embMatMatchDel(&fm); } ajListFree(hlist); ajListFree(hlist); ajListIterDel(&lIter); return; } /* utilities */ /* @funcstatic primersearch_read_primers ***************************** +********* ** ** read primers in from primerfile, classify and compile the patterns ** ** @param [w] primerList [AjPList*] primer list ** @param [u] primerFile [AjPFile] Undocumented ** @param [r] mmp [ajint] Undocumented ** @@ ********************************************************************** +********/ static void primersearch_read_primers(AjPList *primerList, AjPFile pri +merFile, ajint mmp) { AjPStr rdline = NULL; AjPStrTok handle = NULL; ajint nprimers = 0; Primer primdata = NULL; while (ajReadlineTrim(primerFile, &rdline)) { primdata = NULL; if (ajStrGetCharFirst(rdline) == '#') continue; if (ajStrSuffixC(rdline, "..")) continue; AJNEW(primdata); primdata->Name = NULL; primersearch_initialise_pguts(&primdata->forward); primersearch_initialise_pguts(&primdata->reverse); primdata->hitlist = ajListNew(); handle = ajStrTokenNewC(rdline, " \t"); ajStrTokenNextParse(&handle, &primdata->Name); ajStrTokenNextParse(&handle, &primdata->forward->patstr); ajStrFmtUpper(&primdata->forward->patstr); ajStrTokenNextParse(&handle, &primdata->reverse->patstr); ajStrFmtUpper(&primdata->reverse->patstr); ajStrTokenDel(&handle); /* copy patterns for Henry Spencer code */ ajStrAssignC(&primdata->forward->origpat, ajStrGetPtr(primdata->forward->patstr)); ajStrAssignC(&primdata->reverse->origpat, ajStrGetPtr(primdata->reverse->patstr)); /* set the mismatch level */ primdata->forward->mm = (ajint) (ajStrGetLen(primdata->forward->patstr)*mmp)/100; primdata->reverse->mm = (ajint) (ajStrGetLen(primdata->reverse->patstr)*mmp)/100; if(primersearch_classify_and_compile(&primdata)) { ajListPushAppend(*primerList, primdata); nprimers++; } else /* there was something funny about the primer sequences */ { ajUser("Cannot use %s\n", ajStrGetPtr(primdata->Name)); primersearch_free_pguts(&primdata->forward); primersearch_free_pguts(&primdata->reverse); ajStrDel(&primdata->Name); ajListFree(&primdata->hitlist); ajListFree(&primdata->hitlist); AJFREE(primdata); } } ajStrDel(&rdline); return; } /* @funcstatic primersearch_classify_and_compile ********************* +********* ** ** determines pattern type and compiles it ** ** @param [w] primdata [Primer*] primer data ** @return [AjBool] true if useable primer ** @@ ********************************************************************** +********/ static AjBool primersearch_classify_and_compile(Primer* primdata) { /* forward primer */ if(!((*primdata)->forward->type = embPatGetType(((*primdata)->forward->origpat), &((*primdata)->forward->patstr), (*primdata)->forward->mm,0, &((*primdata)->forward->real_len), &((*primdata)->forward->amino), &((*primdata)->forward->carboxyl)))) ajFatal("Illegal pattern"); /* reverse primer */ if(!((*primdata)->reverse->type = embPatGetType(((*primdata)->reverse->origpat), &((*primdata)->reverse->patstr), (*primdata)->reverse->mm,0, &((*primdata)->reverse->real_len), &((*primdata)->reverse->amino), &((*primdata)->reverse->carboxyl)))) ajFatal("Illegal pattern"); embPatCompile((*primdata)->forward->type, (*primdata)->forward->patstr, &((*primdata)->forward->len), &((*primdata)->forward->buf), (*primdata)->forward->off, &((*primdata)->forward->sotable), &((*primdata)->forward->solimit), &((*primdata)->forward->real_len), &((*primdata)->forward->re), &((*primdata)->forward->skipm), (*primdata)->forward->mm ); embPatCompile((*primdata)->reverse->type, (*primdata)->reverse->patstr, &((*primdata)->reverse->len), &((*primdata)->reverse->buf), (*primdata)->reverse->off, &((*primdata)->reverse->sotable), &((*primdata)->reverse->solimit), &((*primdata)->reverse->real_len), &((*primdata)->reverse->re), &((*primdata)->reverse->skipm), (*primdata)->reverse->mm ); return AJTRUE; /* this is a useable primer */ } /* @funcstatic primersearch_primer_search **************************** +********* ** ** tests the primers in primdata against seq and writes results to out +file ** ** @param [r] primerList [const AjPList] primer list ** @param [r] seq [const AjPSeq] sequence ** @@ ********************************************************************** +********/ static void primersearch_primer_search(const AjPList primerList, const AjPSeq seq) { AjIList listIter; /* test each list node against this sequence */ listIter = ajListIterNewread(primerList); while(!ajListIterDone(listIter)) { Primer curr_primer = ajListIterGet(listIter); primersearch_scan_seq(curr_primer, seq, AJFALSE); primersearch_scan_seq(curr_primer, seq, AJTRUE); } ajListIterDel(&listIter); return; } /* @funcstatic primersearch_scan_seq ********************************* +********* ** ** scans the primer pairs against the sequence in either forward ** sense or reverse complemented ** works out amplimer length if the two primers both hit ** ** @param [r] primdata [const Primer] primer data ** @param [r] seq [const AjPSeq] sequence ** @param [r] reverse [AjBool] do reverse ** @@ ********************************************************************** +********/ static void primersearch_scan_seq(const Primer primdata, const AjPSeq seq, AjBool reverse) { AjPStr seqstr = NULL; AjPStr revstr = NULL; AjPStr seqname = NULL; ajuint fhits = 0; ajuint rhits = 0; AjPList fhits_list = NULL; AjPList rhits_list = NULL; /* initialise variables for search */ ajStrAssignC(&seqname,ajSeqGetNameC(seq)); ajStrAssignS(&seqstr, ajSeqGetSeqS(seq)); ajStrAssignS(&revstr, ajSeqGetSeqS(seq)); ajStrFmtUpper(&seqstr); ajStrFmtUpper(&revstr); ajSeqstrReverse(&revstr); fhits_list = ajListNew(); rhits_list = ajListNew(); if(!reverse) { /* test OligoA against forward sequence, and OligoB against revers +e */ embPatFuzzSearch(primdata->forward->type, ajSeqGetBegin(seq), primdata->forward->patstr, seqname, seqstr, fhits_list, primdata->forward->len, primdata->forward->mm, primdata->forward->amino, primdata->forward->carboxyl, primdata->forward->buf, primdata->forward->off, primdata->forward->sotable, primdata->forward->solimit, primdata->forward->re, primdata->forward->skipm, &fhits, primdata->forward->real_len, &(primdata->forward->tidy)); if(fhits) embPatFuzzSearch(primdata->reverse->type, ajSeqGetBegin(seq), primdata->reverse->patstr, seqname, revstr, rhits_list, primdata->reverse->len, primdata->reverse->mm, primdata->reverse->amino, primdata->reverse->carboxyl, primdata->reverse->buf, primdata->reverse->off, primdata->reverse->sotable, primdata->reverse->solimit, primdata->reverse->re, primdata->reverse->skipm, &rhits, primdata->reverse->real_len, &(primdata->reverse->tidy)); } else { /*test OligoB against forward sequence, and OligoA against reverse + */ embPatFuzzSearch(primdata->reverse->type, ajSeqGetBegin(seq), primdata->reverse->patstr, seqname, seqstr, fhits_list, primdata->reverse->len, primdata->reverse->mm, primdata->reverse->amino, primdata->reverse->carboxyl, primdata->reverse->buf, primdata->reverse->off, primdata->reverse->sotable, primdata->reverse->solimit, primdata->reverse->re, primdata->reverse->skipm, &fhits, primdata->reverse->real_len, &(primdata->reverse->tidy)); if(fhits) embPatFuzzSearch(primdata->forward->type, ajSeqGetBegin(seq), primdata->forward->patstr, seqname, revstr, rhits_list, primdata->forward->len, primdata->forward->mm, primdata->forward->amino, primdata->forward->carboxyl, primdata->forward->buf, primdata->forward->off, primdata->forward->sotable, primdata->forward->solimit, primdata->forward->re, primdata->forward->skipm, &rhits, primdata->forward->real_len, &(primdata->forward->tidy)); } if(fhits && rhits) /* get amplimer length(s) and write out the hit */ primersearch_store_hits(primdata, fhits_list, rhits_list, seq, reverse); /* tidy up */ primersearch_clean_hitlist(&fhits_list); primersearch_clean_hitlist(&rhits_list); ajStrDel(&seqstr); ajStrDel(&revstr); ajStrDel(&seqname); return; } /* @funcstatic primersearch_store_hits ******************************* +********* ** ** Store primer hits ** ** @param [r] primdata [const Primer] primer data ** @param [w] fhits [AjPList] forward hits ** @param [w] rhits [AjPList] reverse hits ** @param [r] seq [const AjPSeq] sequence ** @param [r] reverse [AjBool] do reverse ** @@ ********************************************************************** +********/ static void primersearch_store_hits(const Primer primdata, AjPList fhits, AjPList rhits, const AjPSeq seq, AjBool reverse) { ajint amplen = 0; AjIList fi; AjIList ri; PHit primerhit = NULL; fi = ajListIterNewread(fhits); while(!ajListIterDone(fi)) { EmbPMatMatch fm = NULL; EmbPMatMatch rm = NULL; amplen = 0; fm = ajListIterGet(fi); ri = ajListIterNewread(rhits); while(!ajListIterDone(ri)) { ajint seqlen = ajSeqGetLen(seq); ajint s = (fm->start); ajint e; rm = ajListIterGet(ri); e = (rm->start-1); amplen = seqlen-(s-1)-e; if (amplen > 0) /* no point making a hit if -ve length! +*/ { primerhit = NULL; AJNEW(primerhit); primerhit->desc=NULL; /* must be NULL for ajStrAss */ primerhit->seqname=NULL; /* must be NULL for ajStrAss */ primerhit->acc=NULL; primerhit->forward=NULL; primerhit->reverse=NULL; ajStrAssignC(&primerhit->seqname,ajSeqGetNameC(seq)); ajStrAssignS(&primerhit->desc, ajSeqGetDescS(seq)); ajStrAssignS(&primerhit->acc, ajSeqGetAccS(seq)); primerhit->forward_pos = fm->start; primerhit->reverse_pos = rm->start; primerhit->forward_mismatch = fm->mm; primerhit->reverse_mismatch = rm->mm; primerhit->amplen = amplen; if(!reverse) { ajStrAssignS(&primerhit->forward, primdata->forward->patstr); ajStrAssignS(&primerhit->reverse, primdata->reverse->patstr); } else { ajStrAssignS(&primerhit->forward, primdata->reverse->patstr); ajStrAssignS(&primerhit->reverse, primdata->forward->patstr); } ajListPushAppend(primdata->hitlist, primerhit); } } /* ** clean up rListIter here as it will be new'ed again next ** time through */ ajListIterDel(&ri); } ajListIterDel(&fi); return; } /* @funcstatic primersearch_print_hits ******************************* +********* ** ** Print primer hits ** ** @param [r] primerList [const AjPList] primer hits ** @param [w] outf [AjPFile] outfile ** @@ ********************************************************************** +********/ static void primersearch_print_hits(const AjPList primerList, AjPFile +outf) { /* iterate through list of hits */ AjIList lIter; ajint count = 1; lIter = ajListIterNewread(primerList); while(!ajListIterDone(lIter)) { Primer primer = ajListIterGet(lIter); AjIList hIter = ajListIterNewread(primer->hitlist); count = 1; ajFmtPrintF(outf, "\nPrimer name %s\n", ajStrGetPtr(primer->Name)) +; while(!ajListIterDone(hIter)) { PHit hit = ajListIterGet(hIter); ajFmtPrintF(outf, "Amplimer %d\n", count); ajFmtPrintF(outf, "\tSequence: %s %s \n\t%s\n", ajStrGetPtr(hit->seqname), ajStrGetPtr(hit->acc), ajStrGetPtr(hit->desc)); ajFmtPrintF(outf, "\t%s hits forward strand at %d with %d " "mismatches\n", ajStrGetPtr(hit->forward), hit->forward_pos, hit->forward_mismatch); ajFmtPrintF(outf, "\t%s hits reverse strand at [%d] with %d " "mismatches\n", ajStrGetPtr(hit->reverse), (hit->reverse_pos), (hit->reverse_mismatch)); ajFmtPrintF(outf, "\tAmplimer length: %d bp\n", hit->amplen); count++; } ajListIterDel(&hIter); } ajListIterDel(&lIter); return; }

Replies are listed 'Best First'.
Re: Again Fuzzy regex !!!!
by BrowserUk (Patriarch) on May 22, 2015 at 22:23 UTC

    If you really want help with this, then you are going to have to do a lot more than simply throw a gob-load of C code -- that itself has a gob-load of unreferenced & unsupplied dependencies -- at us and expect a conversion service.

    The first thing I'd want to know is:

    but it was slow

    How slow?

    Please give comprehensive and accurate numbers. Including:

    • How big is/are the "long string"(s)?

      And how many of them are there to be processed? (Ie. Does a given run consist of all the shorts against a single long or multiple longs?)

    • How many " short string ( up to 25 letter )"(s) are you comparing against each long string?

      What's the minimum length?

      And how fuzzy? Ie How many out of a 25 base string have to match?

      If the short is shorter than 25, say 15, how many have to match?

    • How long did your XOR version take to run?

      Be precise. How many shorts? How long was the long?

      How long did it take. If was single tasking, how many elapsed seconds, and on what processor?

      If it was multi-tasking, how many concurrent threads; what processor? How many total cycles/cpu seconds?

    I found this code which has been writen was in c but I couldn't handle it

    What about it couldn't you handle?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Dear BrowserUk thank you very much for your concern : 1- the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds the faster Xor was this :

      #! perl -slw use strict; use bytes; our $FUZZY ||= 4; open KEYS, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my @keys = <KEYS>; close KEYS; chomp @keys; warn "Loaded ${ \scalar @keys } keys"; my $seq ; my $seqnam; readseqfile(); my( $masked, $pos ); my $totalLen = 0; my $count = 0; my $seqLen = length $seq; $totalLen += $seqLen; for my $key ( @keys ) { my $keyLen = length $key; my $mask = $key x ( int( $seqLen / $keyLen ) + 1 ); my $maskLen = length $mask; my $minZeros = chr( 0 ) x int( $keyLen / ( $FUZZY + 1 ) ); my $minZlen = length $minZeros; for my $offset1 ( 0 .. $keyLen-1 ) { $masked = $mask ^ substr( $seq, $offset1, $maskLen ); $pos = 0; while( $pos = 1+index $masked, $minZeros, $pos ) { $pos--; my $offset2 = $pos - ($pos % $keyLen ); last unless $offset1 + $offset2 + $keyLen <= $seqLen; my $fuz = $keyLen - ( substr( $masked, $offset2, $keyLen ) =~ tr[\0] +[\0] ); if( $fuz <= $FUZZY ) { #printf "\tFuzzy matched key:'$key' -v- '%s' in li +ne:" # . "%2d @ %6d (%6d+%6d) with fuzziness: %d\n" +, # substr( $seq, $offset1 + $offset2, $keyLen ), # $., $offset1 + $offset2, $offset1, $offset2, +$fuz; } $pos = $offset2 + $keyLen; } } } warn "\n\nProcessed $. sequences"; warn "Average length: ", $totalLen / $.; sub readseqfile { open( SEQ, "<$ARGV[1]" ); while (<SEQ>) { chomp(); if (/>(\S+)/) { $seqnam = $1; $seq = ""; } else { $seq .= $_; } } close SEQ; }
      2- how big the maximum data is the genome of wheat 12 gigabyte every file contains about multiple 30 megabyte data. 3- how short from 10 letters to 25 letter and the missmatch is about 25% percent of the length. 4- as mention the faster Xor above took 10 seconds to just report matches with miss match 4 letters for 18 letter against 30274277. 5- " what about it couldn't you handle " : A- about Xor : yes I could but the things that I notice that when the code slice the genome "target" for comparison it took too much " about 1 second
      $masked = $mask ^ substr( $seq, $offset1, $maskLen );
      B- about the c code I only want to know what exactly the method he used and If there is modules like it in perl and if I can take only the part of fuzzysearch (library/code) in this code and integrate it in my code as an outside part. why his code is faster . # yes I looked in all how asked the same question like me but does really this is what perl can offer the half of the speed of c or ther is more
        the faster XOr could handle to return matches for one 18 letter against 30274277 with 4 missmatches in 10 seconds but the c code could do for two 18 letters against the same data in 5 seconds

        Yes. That is about as good as you will get from pure Perl code. C will usually be faster.

        There is no point working out what algorithm the C code is using because if you reimplemented it in Perl is would be much slower.

        Perl is very poor at handling strings on a byte-by-byte basis -- you need to call a function (substr) to get at each and every byte; whereas C only need increment an address register.

        My XOR code that you've reposted above plays to Perl's strengths by using single op-codes on the long strings to perform the majority of the processing; but in the end you need to call substr many times to extract the matches and substr is very slow.

        If you need to stick to Perl, what you have is about as good as it is likely to get. If you need faster, then you'll have to bite the bullet and learn C.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Again Fuzzy regex !!!!
by AnomalousMonk (Archbishop) on May 22, 2015 at 22:48 UTC

    I don't have the time right now to go over almost nine hundred lines of C code to extract its functionality, but here's a version of your OPed Perl code that may perhaps serve as a starting point for what you want:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $long = 'xxxxxxAAAAAxxxxxx'; my $short = 'AAxAA'; ;; my $len = length $short; for my $offset (0 .. length($long) - $len) { my $m = ($short ^ substr $long, $offset, $len) =~ tr{\0}{\0}; print qq{matched $m at offset $offset}; } " matched 1 at offset 0 matched 1 at offset 1 matched 2 at offset 2 matched 3 at offset 3 matched 2 at offset 4 matched 3 at offset 5 matched 4 at offset 6 matched 3 at offset 7 matched 2 at offset 8 matched 3 at offset 9 matched 2 at offset 10 matched 1 at offset 11 matched 1 at offset 12

    Update: Pay close attention to the advice of BrowserUk; big-data processing is his bread and butter.


    Give a man a fish:  <%-(-(-(-<

Re: Again Fuzzy regex !!!!
by Laurent_R (Canon) on May 22, 2015 at 22:49 UTC
    Hmm, I don't really understand what you asking for. Do you expect some monk(s) to rewrite 850 lines of C code into Perl? Even though doing the same thing in Perl is likely to take several times less code lines, you can't ask that from monks trying to help others for free to do that.

    I have just spent most of the last weeks rewriting into 3 Perl programs a really complex bunch of 50+ shell programs (tens of thousands of shell code lines) calling each other with various data elements, creating several hundred files in two dozen of different directories, with these processing running on five different computing platform. I can tell you that this is one of the most tedious developing tasks I had to do in the last decade or perhaps longer. Porting an algorithm is easy enough, porting a full application can be far more complicated.

    I wish I could have refused to do it, even if that meant not being paid for those weeks (I am freelance). But I could not really refuse, because I am really the only expert on that very narrow subject, so that anyone else would have spent much much more time than me. My client would have hardly forgiven such an attitude from me.

    I can tell you one thing: rewriting code into another language is often really painful. I would really have preferred to have a clear requirement specification and to do everything over from scratch. So, in brief, I am not gonna help you porting your C program to Perl (especially not for free).

    But, if you know well enough what you need, and if you want to analyze yourself the C program and understand what it does, and can explain clearly what fuzzy matching you are looking for, and give clear specification of the requirement, then probably many monks, including myself, will really be willing to help (if I can, as far as I am concerned).

      Do you expect some monk(s) to rewrite 850 lines of C code into Perl?

      If you think the 850 lines is bad, you wait till you cop a load of the 700,000 lines of C (with embedded mainframe weirdness) that those 850 lines are dependent upon :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      Dear Laurent_R first of all thank you very much for your answer I really don't need to convert the c code into perl all what I need is a perl code do the process as fast as this code do " the part of fuzzy match comparison " just this part which I really . and If there is no module in perl or code could do the same , I only need any one to highlight the library or the module in c that this man used which I can use too as an external c code in my tool. I know that you are monks are very very busy and I only seek for your wisdom to tell me the correct path through this .
Re: Again Fuzzy regex !!!!
by FreeBeerReekingMonk (Deacon) on May 22, 2015 at 23:08 UTC

    Tachyon already gave a solution: use agrep!
    agrep help source here: agrep
    as previously discussed here: node_id=406836
    (This is a must-read for you, read all the comments, as I do not know what you exactly want, agrep might actually NOT be what you need)

    Then there is: String::Approx
    And an implementation example of that.


    Edit: New version of String-Approx broke the link