in reply to Re^3: Exact string matching
in thread Exact string matching

What are the minimum and maximum substring lengths you need to work with for your real data? With what you imply here this is never going to be a fast task! Oh, and you still haven't told us what you are trying to do! The more detail you give us about what you are trying to achieve the better we can help. So far you really haven't given us much.

You may be interested in Fast common substring matching which attempts to solve a problem that on the face of it may be similar to the problem you are working on.

True laziness is hard work

Replies are listed 'Best First'.
Re^5: Exact string matching
by saranrsm (Acolyte) on Oct 17, 2011 at 10:01 UTC
    Dear Monk,
    I went through your post, that was a well proposed post for similarity search, i.e. finding the best match between given set of strings (Correct me if I am wrong). But here we are dealing with a single text or string and not more than that and the aim is to find out all possible repeats present in the given string (string is to be a sequence of DNA,the file that I refereed earlier in my post). The minimum length of the substring should be 3 and maximum should be n-1 (n=length of the string)
    
    Here i have presented an example to get an idea of it.
    
    eg) $text='AAATGAAAT'
    
    Window or substring of length =3
    set of all possible substrings={AAA,AAT,ATG,TGA,GAA,AAA,AAT}
    AAA => Occurred 2 times at 0 and 5
    AAT => Occurred 2 times at 1 and 6
    no more repeats found
    
    Window or substring of length =4
    set of all possible substrings={AAAT,AATG,ATGA,TGAA,GAAA,AAAT}
    AAAT => Occurred 2 times at 0 and 5
    no more repeats found
    
    Window or substring of length =5
    set of all possible substrings={AAATG,AATGA,ATGAA,TGAAA,GAAAT}
    no repeats found
    
    and so on upto length of 9-1=8
    because if there a case where our string looks like this $text='AAAAAAAAA'
    then, 
    Window or substring of length =8
    set of all possible substrings={AAAAAAAA,AAAAAAAA}
    AAAAAAAA => Occurred 2 times at 0 and 1
    
    so I have defined the maximum length of the substring to be n-1 (9-1=8) such that we dont miss any repeats of any length.
    
    ya, as you had quoted, that what I had will never be a faster task and I agree with you, that why i need a faster algorithm or a faster alternative.
    
    if I am not clear yet, please let me know.
    

      Why? Why do you want to do this? What are you hoping to achieve by obtaining counts of repeats in this fashion? I completely understand what you want to do in the very limited context you have described. I have no idea at all why you want to do it, and it is that information that is most likely to get you a more practical solution - a solution to the bigger problem rather than a solution to a small part of it.

      As an aside: how much RAM do you have in your computer and is it a 64 bit system? Are you using a 64 bit build of Perl? Both of these will have an influence on the solutions you may have available.

      True laziness is hard work

        "a solution to the bigger problem rather than a solution to a small part of it."

        Dear Monk, you are exactly correct. the reason for this work is to do with a biological context. In the context of DNA there are different types of repeat families (ya we do have families of repeats) such as tandem repeats are one type, where the repeats are in tandem eg) AATAAT,(AAT is repeated in tandem) next is palindromic repeat where a palindrome is repeated eg) AATTAAATCGACTAATTAA (AATTAA, a palindrome is repeated with some spacer in between)and so on..these repeats regions associates to various diseases and they also act as marker or as landmark to hunt down particular regions.(there are certain regions, that are flanked by certain repeat families)

        So if i have the informations such as the positions and the frequencies of the a particular repeat family, one could hunt down certain regions which are flanked by these repeats and with the number of frequencies of the certain repeats one could estimate how likely is this DNA (i.e. the string in our scripting context) mutated

        I hope i am clear this time, if not plz let me know

        system information RAM: 4 GB, Processor: Intel® Core™2 Quad CPU Q8300 @ 2.50GHz × 4, 64 bit machine, perl-v5.14.1 built for x86_64-linux, OS - Fedora 15