Re: generating hash patterns for searching with one mismatch

With only 1 mismatch, just throw all the patterns together and let Perl's regex engine have at it:

$pat = join '|', reverse sort keys %pattern;
$pat = qr/$pat/;
[download]

Any vaguely recent Perl has the "trie optimization" which will automatically merge the branches. You can run with perl -Mre=debug to see what's going on.

EDIT: For more than 1-2 mismatches, you'll need to do something a bit fancier, either constructing a suffix tree or using seeds (you can show that any k-mismatch pattern must contain a substring of your original pattern of at least a certain length). I don't know of a Perl module off-hand to do either.

Comment on Re: generating hash patterns for searching with one mismatch Select or Download Code

Replies are listed 'Best First'.
Re^2: generating hash patterns for searching with one mismatch by cedance (Novice) on Mar 18, 2011 at 09:21 UTC
Educated_foo, As you rightly pointed out, I am indeed working on biological data. Its a fastq format file. The second line, of every 4 lines; 1:4,5:8 etc.., is a sequence and I am looking for certain patterns and I have to remove them if found. I did exactly the same as you have mentioned here. However, there's 1 other optimization possible, if you know that there are not going to be that many matches. I have about 20 million reads (sequences) and I know for a fact that there can't be more than 1 million. In this case, I decided to split my substring into 2 parts: sub1 = "first half" sub2 = "second half" Now, with this condition, `if ( $seq !~ m/$sub1/ && $seq !~ m/$sub2/ ) { # this means there are at least 2 mismatches # the substring you are looking for is not here # so don't check for any patterns, just "next;" }` [download] I guess this doesn't mean much if your data is small or if the substrings occur too often. But it does result in faster code by about 8-10x times. Thanks for the tip regarding trying for more mismatches. I have always wanted to code for suffix arrays. Now may be the right time to experiment, that I have a huge data in my hands. Thanks once again for all your valuable opinions!	[reply] [d/l]
Re^3: generating hash patterns for searching with one mismatch by educated_foo (Vicar) on Mar 18, 2011 at 16:09 UTC
Do you get a bit more speed by doing this? `if ($seq !~ /$sub1\|$sub2/) { ... }` [download] (It's fun to code in a domain where performance matters.)	[reply] [d/l]