cedance has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a string and I would like to search for a substring with 0 or 1 mismatch. So, I went ahead and created all possible patterns (so that later I could regexp with foreach over all patterns):

$pattern{$opt_a} = 1; for ($i=0; $i<length($opt_a); $i++) { $prop = $opt_a; substr($prop, $i, 1) = "."; $pattern{$prop}++; }

I have 2 questions about this:

1) Is there an efficient way than this to search for mismatch?

2) If not, is there a possibility to obtain this set of patterns in a more efficient manner rather than using the typical for-loop structure ( I would suppose perl syntaxes enables us to write shorter than say C/C++)?

Thank you!

Replies are listed 'Best First'.
Re: generating hash patterns for searching with one mismatch
by Eliya (Vicar) on Mar 17, 2011 at 11:02 UTC

    Here's another suggestion using string-XOR techique:

    my $str = "abcabcdefgabcdXfgabYdefgabcdZZg"; my $match = "abcdefg"; my $len = length($match); for my $i (0..length($str)-$len) { my $s = substr $str, $i, $len; my $x = $s ^ $match; $x =~ tr/\0//d; # equal characters xor to \0 print "'$s' at offset $i matched" unless length($x) > 1; } __END__ 'abcdefg' at offset 3 matched 'abcdXfg' at offset 10 matched 'abYdefg' at offset 17 matched

      An alternative to tranliterating to delete the \0 characters and then getting the length of what's left would be to complement the \0's, assigning to a scalar so as to count all non-nulls.

      my $count = $x =~ tr/\0//c; print "'$s' at offset $i matched" unless $count > 1;

      It's six of one, a half-dozen of the other I suppose but I thought it might be worth a mention.

      Cheers,

      JohnGG

      Hi AR and Ratazong,

      Thanks for the links to modules. But for me, "mismatch" only means "replacement". There should be no insertions or deletions. It seems to me that the module checks for all types of mismatches. no?

      Hi Eliya,

      I think I understand the method, but I doubt its efficient for my case. I forgot to mention that I have to perform the same for about 2e7 lines. Do you think it would be faster if we obtain all possible substrings for all lines?

      For sake of this discussion, lets assume the line length is 50 and substring to check for each line is 20. So what I did was to create 21 different patterns, with the last 20 patterns replacing each character with a "." for regexp wild character comparison.

      Then, its pretty straightforward:

      for ($i=0; $i < "total lines"; $i++) { for $p (keys %pattern) { if ( $line[$i] =~ m/$p/ ) { # match; do something and break loop with "last;" } } }
        but I doubt its efficient for my case

        I haven't benchmarked it, but I guess a comparison with your regex approach wouldn't turn out all that bad.

        A regex pattern match also has to go through the entire string (or up until the first match). And you have to do it N times (21 in your example). The internal regex implementation, however, has the advantage of being fast, because it's written in C.

        So it's essentially going through the string once, but slowly (iteration done in Perl), as opposed to going through the string 21 times, but fast...

        The XOR method would also lend itself particularly well to being implemented in C (e.g. via Inline::C), in case you should need real speed.

        Update: it turns out that even the Perl implementation of the XOR method is more than 4 times as fast:

        #!/usr/bin/perl -w use strict; use Benchmark "cmpthese"; my $match = "abcdefghijklmnopqrst"; my $matchlen = length($match); my $linelen = 50; my @lines; my $i = 0; for (1..1e3) { my $line = join '', map chr(32+rand 96), 1..$linelen; my $s = $match; substr($s, $i++ % $matchlen, 1) = 'X'; substr($line, rand $linelen-$matchlen, $matchlen) = $s; push @lines, $line; } my @patterns = ($match); for my $i (0..$matchlen-1) { my $p = $match; substr($p, $i, 1) = "."; push @patterns, qr/$p/; } sub xorstr { my $line = shift; for my $i (0..$linelen-$matchlen) { my $s = substr $line, $i, $matchlen; my $x = $s ^ $match; my $diff = $x =~ tr/\0//c; return $s unless $diff > 1; } } sub regex { my $line = shift; for my $p (@patterns) { if ( $line =~ m/($p)/ ) { return $1; } } } cmpthese(-5, { regex => sub { my $r = regex($_) for @lines }, xorstr => sub { my $r = xorstr($_) for @lines }, } ); __END__ Rate regex xorstr regex 17.3/s -- -82% xorstr 97.3/s 462% --
Re: generating hash patterns for searching with one mismatch
by Ratazong (Monsignor) on Mar 17, 2011 at 10:52 UTC

    hi

    1) maybe Text::Levenshtein will help you here; it calculates the Levenshtein_distance for you, which would be 1 in your case. Possibly the Hamming_distance can also help you here...

    2) efficiency is hard to define; even if you use a perl-idiom like map, there will be still an underlying loop; and using fewer keystrokes in your perl-program often reduces readability - especially if you use tricks/constructs you are not used to read ... so in my eyes efficiency is depending on your level of knowledge (and as that level is likely to increase, your rating what's efficient and what's not is likely to change)

    HTH, Rata

    Update:

    educated_foo provided a comment to this node (thanks!), so I think a bit of clarification is helpful:

    I thought about using the Levenshtein-Distance the following way:

    my $len = length($pattern); for (my $i = 0; i < lenght($str)-$len; $i++) { if (distance( substr ($str, $i, $len), $pattern) < 2) { print "si +milar"; } }
    That way you would get the similarity, and since the length of the substring and the length of the pattern are the same, inserts/deletes are no issue.

    However since the further discussion showed that cedance wants a high-performance solution, I would not recommend that approach any longer... it is nice, elegant and slow ;-)

    Regarding my comment on efficiency: seems I was misleaded by the wording perl syntaxes enables us to write shorter than say C/C++. It has been clarified later on in the thread ... seems I was victim of an XY Problem here.

      1) maybe Text::Levenshtein will help you here; it calculates the Levenshtein_distance for you,
      Which is nothing like what he wants, since it allows indels. He wants hamming distance, so T::L would do the wrong thing.
      efficiency is hard to define;
      No it isn't: he probably wants to match many short strings against a long one (like a chromosome, with millions of characters) as quickly as possible.
Re: generating hash patterns for searching with one mismatch
by AR (Friar) on Mar 17, 2011 at 10:51 UTC
Re: generating hash patterns for searching with one mismatch
by educated_foo (Vicar) on Mar 17, 2011 at 15:47 UTC
    With only 1 mismatch, just throw all the patterns together and let Perl's regex engine have at it:
    $pat = join '|', reverse sort keys %pattern; $pat = qr/$pat/;
    Any vaguely recent Perl has the "trie optimization" which will automatically merge the branches. You can run with perl -Mre=debug to see what's going on.

    EDIT: For more than 1-2 mismatches, you'll need to do something a bit fancier, either constructing a suffix tree or using seeds (you can show that any k-mismatch pattern must contain a substring of your original pattern of at least a certain length). I don't know of a Perl module off-hand to do either.

      Educated_foo,

      As you rightly pointed out, I am indeed working on biological data. Its a fastq format file. The second line, of every 4 lines; 1:4,5:8 etc.., is a sequence and I am looking for certain patterns and I have to remove them if found.

      I did exactly the same as you have mentioned here. However, there's 1 other optimization possible, if you know that there are not going to be that many matches. I have about 20 million reads (sequences) and I know for a fact that there can't be more than 1 million. In this case, I decided to split my substring into 2 parts:

      sub1 = "first half"

      sub2 = "second half"

      Now, with this condition,

      if ( $seq !~ m/$sub1/ && $seq !~ m/$sub2/ ) { # this means there are at least 2 mismatches # the substring you are looking for is not here # so don't check for any patterns, just "next;" }

      I guess this doesn't mean much if your data is small or if the substrings occur too often. But it does result in faster code by about 8-10x times.

      Thanks for the tip regarding trying for more mismatches. I have always wanted to code for suffix arrays. Now may be the right time to experiment, that I have a huge data in my hands.

      Thanks once again for all your valuable opinions!

        Do you get a bit more speed by doing this?
        if ($seq !~ /$sub1|$sub2/) { ... }
        (It's fun to code in a domain where performance matters.)