Educated_foo,
As you rightly pointed out, I am indeed working on biological data. Its a fastq format file. The second line, of every 4 lines; 1:4,5:8 etc.., is a sequence and I am looking for certain patterns and I have to remove them if found.
I did exactly the same as you have mentioned here. However, there's 1 other optimization possible, if you know that there are not going to be that many matches. I have about 20 million reads (sequences) and I know for a fact that there can't be more than 1 million. In this case, I decided to split my substring into 2 parts:
sub1 = "first half"
sub2 = "second half"
Now, with this condition,
if ( $seq !~ m/$sub1/ && $seq !~ m/$sub2/ ) { # this means there are at least 2 mismatches # the substring you are looking for is not here # so don't check for any patterns, just "next;" }
I guess this doesn't mean much if your data is small or if the substrings occur too often. But it does result in faster code by about 8-10x times.
Thanks for the tip regarding trying for more mismatches. I have always wanted to code for suffix arrays. Now may be the right time to experiment, that I have a huge data in my hands.
Thanks once again for all your valuable opinions!
In reply to Re^2: generating hash patterns for searching with one mismatch
by cedance
in thread generating hash patterns for searching with one mismatch
by cedance
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |