Dear Monk,
Here's my exact problem.
I want to find out the positions and the frequency of all possible repeated substrings of fixed length from a given file which looks like this (http://www.ncbi.nlm.nih.gov/nuccore/AC245882.2?report=fasta&log$=seqview&format=text).
eg)$text='ATCGCTATCGC'
For substring of length 3
ATC => Occurred 2 times, at 0 and 6
CGC => Occurred 2 times, at 2 and 8
For substring of length 4
ATCG => Occurred 2 times, at 0 and 6
TCGC => Occurred 2 times, at 1 and 7
and so on..
If I used my script (which I have posted), it would take ages to complete for the above linked file.
if I am not clear yet, please let me know.
| [reply] |
What are the minimum and maximum substring lengths you need to work with for your real data? With what you imply here this is never going to be a fast task! Oh, and you still haven't told us what you are trying to do! The more detail you give us about what you are trying to achieve the better we can help. So far you really haven't given us much.
You may be interested in Fast common substring matching which attempts to solve a problem that on the face of it may be similar to the problem you are working on.
True laziness is hard work
| [reply] |
Dear Monk,
I went through your post, that was a well proposed post for similarity search, i.e. finding the best match between given set of strings (Correct me if I am wrong). But here we are dealing with a single text or string and not more than that and the aim is to find out all possible repeats present in the given string (string is to be a sequence of DNA,the file that I refereed earlier in my post). The minimum length of the substring should be 3 and maximum should be n-1 (n=length of the string)
Here i have presented an example to get an idea of it.
eg) $text='AAATGAAAT'
Window or substring of length =3
set of all possible substrings={AAA,AAT,ATG,TGA,GAA,AAA,AAT}
AAA => Occurred 2 times at 0 and 5
AAT => Occurred 2 times at 1 and 6
no more repeats found
Window or substring of length =4
set of all possible substrings={AAAT,AATG,ATGA,TGAA,GAAA,AAAT}
AAAT => Occurred 2 times at 0 and 5
no more repeats found
Window or substring of length =5
set of all possible substrings={AAATG,AATGA,ATGAA,TGAAA,GAAAT}
no repeats found
and so on upto length of 9-1=8
because if there a case where our string looks like this $text='AAAAAAAAA'
then,
Window or substring of length =8
set of all possible substrings={AAAAAAAA,AAAAAAAA}
AAAAAAAA => Occurred 2 times at 0 and 1
so I have defined the maximum length of the substring to be n-1 (9-1=8) such that we dont miss any repeats of any length.
ya, as you had quoted, that what I had will never be a faster task and I agree with you, that why i need a faster algorithm or a faster alternative.
if I am not clear yet, please let me know.
| [reply] |