generating hash patterns for searching with one mismatch

cedance has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: generating hash patterns for searching with one mismatch by Eliya (Vicar) on Mar 17, 2011 at 11:02 UTC
Here's another suggestion using string-XOR techique: `my $str = "abcabcdefgabcdXfgabYdefgabcdZZg"; my $match = "abcdefg"; my $len = length($match); for my $i (0..length($str)-$len) { my $s = substr $str, $i, $len; my $x = $s ^ $match; $x =~ tr/\0//d; # equal characters xor to \0 print "'$s' at offset $i matched" unless length($x) > 1; } __END__ 'abcdefg' at offset 3 matched 'abcdXfg' at offset 10 matched 'abYdefg' at offset 17 matched` [download]	[reply] [d/l]
Re^2: generating hash patterns for searching with one mismatch by johngg (Canon) on Mar 17, 2011 at 12:04 UTC
An alternative to tranliterating to delete the `\0` characters and then getting the length of what's left would be to complement the `\0`'s, assigning to a scalar so as to count all non-nulls. `my $count = $x =~ tr/\0//c; print "'$s' at offset $i matched" unless $count > 1;` [download] It's six of one, a half-dozen of the other I suppose but I thought it might be worth a mention. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: generating hash patterns for searching with one mismatch by cedance (Novice) on Mar 17, 2011 at 12:01 UTC
Hi AR and Ratazong, Thanks for the links to modules. But for me, "mismatch" only means "replacement". There should be no insertions or deletions. It seems to me that the module checks for all types of mismatches. no? Hi Eliya, I think I understand the method, but I doubt its efficient for my case. I forgot to mention that I have to perform the same for about 2e7 lines. Do you think it would be faster if we obtain all possible substrings for all lines? For sake of this discussion, lets assume the line length is 50 and substring to check for each line is 20. So what I did was to create 21 different patterns, with the last 20 patterns replacing each character with a "." for regexp wild character comparison. Then, its pretty straightforward: `for ($i=0; $i < "total lines"; $i++) { for $p (keys %pattern) { if ( $line[$i] =~ m/$p/ ) { # match; do something and break loop with "last;" } } }` [download]	[reply] [d/l]
Re^3: generating hash patterns for searching with one mismatch by Eliya (Vicar) on Mar 17, 2011 at 12:29 UTC
but I doubt its efficient for my case I haven't benchmarked it, but I guess a comparison with your regex approach wouldn't turn out all that bad. A regex pattern match also has to go through the entire string (or up until the first match). And you have to do it N times (21 in your example). The internal regex implementation, however, has the advantage of being fast, because it's written in C. So it's essentially going through the string once, but slowly (iteration done in Perl), as opposed to going through the string 21 times, but fast... The XOR method would also lend itself particularly well to being implemented in C (e.g. via Inline::C), in case you should need real speed. Update: it turns out that even the Perl implementation of the XOR method is more than 4 times as fast: #!/usr/bin/perl -w use strict; use Benchmark "cmpthese"; my $match = "abcdefghijklmnopqrst"; my $matchlen = length($match); my $linelen = 50; my @lines; my $i = 0; for (1..1e3) { my $line = join '', map chr(32+rand 96), 1..$linelen; my $s = $match; substr($s, $i++ % $matchlen, 1) = 'X'; substr($line, rand $linelen-$matchlen, $matchlen) = $s; push @lines, $line; } my @patterns = ($match); for my $i (0..$matchlen-1) { my $p = $match; substr($p, $i, 1) = "."; push @patterns, qr/$p/; } sub xorstr { my $line = shift; for my $i (0..$linelen-$matchlen) { my $s = substr $line, $i, $matchlen; my $x = $s ^ $match; my $diff = $x =~ tr/\0//c; return $s unless $diff > 1; } } sub regex { my $line = shift; for my $p (@patterns) { if ( $line =~ m/($p)/ ) { return $1; } } } cmpthese(-5, { regex => sub { my $r = regex($_) for @lines }, xorstr => sub { my $r = xorstr($_) for @lines }, } ); __END__ Rate regex xorstr regex 17.3/s -- -82% xorstr 97.3/s 462% -- [download]	[reply] [d/l]
Re^4: generating hash patterns for searching with one mismatch by cedance (Novice) on Mar 18, 2011 at 09:11 UTC
Re: generating hash patterns for searching with one mismatch by Ratazong (Monsignor) on Mar 17, 2011 at 10:52 UTC
hi 1) maybe Text::Levenshtein will help you here; it calculates the Levenshtein_distance for you, which would be 1 in your case. Possibly the Hamming_distance can also help you here... 2) efficiency is hard to define; even if you use a perl-idiom like `map`, there will be still an underlying loop; and using fewer keystrokes in your perl-program often reduces readability - especially if you use tricks/constructs you are not used to read ... so in my eyes efficiency is depending on your level of knowledge (and as that level is likely to increase, your rating what's efficient and what's not is likely to change) HTH, Rata Update: educated_foo provided a comment to this node (thanks!), so I think a bit of clarification is helpful: I thought about using the Levenshtein-Distance the following way: `my $len = length($pattern); for (my $i = 0; i < lenght($str)-$len; $i++) { if (distance( substr ($str, $i, $len), $pattern) < 2) { print "si +milar"; } }` [download] That way you would get the similarity, and since the length of the substring and the length of the pattern are the same, inserts/deletes are no issue. However since the further discussion showed that cedance wants a high-performance solution, I would not recommend that approach any longer... it is nice, elegant and slow ;-) Regarding my comment on efficiency: seems I was misleaded by the wording perl syntaxes enables us to write shorter than say C/C++. It has been clarified later on in the thread ... seems I was victim of an XY Problem here.	[reply] [d/l] [select]
Re^2: generating hash patterns for searching with one mismatch by educated_foo (Vicar) on Mar 17, 2011 at 15:54 UTC
1) maybe Text::Levenshtein will help you here; it calculates the Levenshtein_distance for you, Which is nothing like what he wants, since it allows indels. He wants hamming distance, so T::L would do the wrong thing. efficiency is hard to define; No it isn't: he probably wants to match many short strings against a long one (like a chromosome, with millions of characters) as quickly as possible.	[reply]
Re: generating hash patterns for searching with one mismatch by AR (Friar) on Mar 17, 2011 at 10:51 UTC
Have you taken a look at String::Approx?	[reply]
Re: generating hash patterns for searching with one mismatch by educated_foo (Vicar) on Mar 17, 2011 at 15:47 UTC
With only 1 mismatch, just throw all the patterns together and let Perl's regex engine have at it: `$pat = join '\|', reverse sort keys %pattern; $pat = qr/$pat/;` [download] Any vaguely recent Perl has the "trie optimization" which will automatically merge the branches. You can run with `perl -Mre=debug` to see what's going on. EDIT: For more than 1-2 mismatches, you'll need to do something a bit fancier, either constructing a suffix tree or using seeds (you can show that any k-mismatch pattern must contain a substring of your original pattern of at least a certain length). I don't know of a Perl module off-hand to do either.	[reply] [d/l] [select]
Re^2: generating hash patterns for searching with one mismatch by cedance (Novice) on Mar 18, 2011 at 09:21 UTC
Educated_foo, As you rightly pointed out, I am indeed working on biological data. Its a fastq format file. The second line, of every 4 lines; 1:4,5:8 etc.., is a sequence and I am looking for certain patterns and I have to remove them if found. I did exactly the same as you have mentioned here. However, there's 1 other optimization possible, if you know that there are not going to be that many matches. I have about 20 million reads (sequences) and I know for a fact that there can't be more than 1 million. In this case, I decided to split my substring into 2 parts: sub1 = "first half" sub2 = "second half" Now, with this condition, `if ( $seq !~ m/$sub1/ && $seq !~ m/$sub2/ ) { # this means there are at least 2 mismatches # the substring you are looking for is not here # so don't check for any patterns, just "next;" }` [download] I guess this doesn't mean much if your data is small or if the substrings occur too often. But it does result in faster code by about 8-10x times. Thanks for the tip regarding trying for more mismatches. I have always wanted to code for suffix arrays. Now may be the right time to experiment, that I have a huge data in my hands. Thanks once again for all your valuable opinions!	[reply] [d/l]
Re^3: generating hash patterns for searching with one mismatch by educated_foo (Vicar) on Mar 18, 2011 at 16:09 UTC
Do you get a bit more speed by doing this? `if ($seq !~ /$sub1\|$sub2/) { ... }` [download] (It's fun to code in a domain where performance matters.)	[reply] [d/l]