in reply to generating hash patterns for searching with one mismatch

Here's another suggestion using string-XOR techique:

my $str = "abcabcdefgabcdXfgabYdefgabcdZZg"; my $match = "abcdefg"; my $len = length($match); for my $i (0..length($str)-$len) { my $s = substr $str, $i, $len; my $x = $s ^ $match; $x =~ tr/\0//d; # equal characters xor to \0 print "'$s' at offset $i matched" unless length($x) > 1; } __END__ 'abcdefg' at offset 3 matched 'abcdXfg' at offset 10 matched 'abYdefg' at offset 17 matched

Replies are listed 'Best First'.
Re^2: generating hash patterns for searching with one mismatch
by johngg (Canon) on Mar 17, 2011 at 12:04 UTC

    An alternative to tranliterating to delete the \0 characters and then getting the length of what's left would be to complement the \0's, assigning to a scalar so as to count all non-nulls.

    my $count = $x =~ tr/\0//c; print "'$s' at offset $i matched" unless $count > 1;

    It's six of one, a half-dozen of the other I suppose but I thought it might be worth a mention.

    Cheers,

    JohnGG

Re^2: generating hash patterns for searching with one mismatch
by cedance (Novice) on Mar 17, 2011 at 12:01 UTC

    Hi AR and Ratazong,

    Thanks for the links to modules. But for me, "mismatch" only means "replacement". There should be no insertions or deletions. It seems to me that the module checks for all types of mismatches. no?

    Hi Eliya,

    I think I understand the method, but I doubt its efficient for my case. I forgot to mention that I have to perform the same for about 2e7 lines. Do you think it would be faster if we obtain all possible substrings for all lines?

    For sake of this discussion, lets assume the line length is 50 and substring to check for each line is 20. So what I did was to create 21 different patterns, with the last 20 patterns replacing each character with a "." for regexp wild character comparison.

    Then, its pretty straightforward:

    for ($i=0; $i < "total lines"; $i++) { for $p (keys %pattern) { if ( $line[$i] =~ m/$p/ ) { # match; do something and break loop with "last;" } } }
      but I doubt its efficient for my case

      I haven't benchmarked it, but I guess a comparison with your regex approach wouldn't turn out all that bad.

      A regex pattern match also has to go through the entire string (or up until the first match). And you have to do it N times (21 in your example). The internal regex implementation, however, has the advantage of being fast, because it's written in C.

      So it's essentially going through the string once, but slowly (iteration done in Perl), as opposed to going through the string 21 times, but fast...

      The XOR method would also lend itself particularly well to being implemented in C (e.g. via Inline::C), in case you should need real speed.

      Update: it turns out that even the Perl implementation of the XOR method is more than 4 times as fast:

      #!/usr/bin/perl -w use strict; use Benchmark "cmpthese"; my $match = "abcdefghijklmnopqrst"; my $matchlen = length($match); my $linelen = 50; my @lines; my $i = 0; for (1..1e3) { my $line = join '', map chr(32+rand 96), 1..$linelen; my $s = $match; substr($s, $i++ % $matchlen, 1) = 'X'; substr($line, rand $linelen-$matchlen, $matchlen) = $s; push @lines, $line; } my @patterns = ($match); for my $i (0..$matchlen-1) { my $p = $match; substr($p, $i, 1) = "."; push @patterns, qr/$p/; } sub xorstr { my $line = shift; for my $i (0..$linelen-$matchlen) { my $s = substr $line, $i, $matchlen; my $x = $s ^ $match; my $diff = $x =~ tr/\0//c; return $s unless $diff > 1; } } sub regex { my $line = shift; for my $p (@patterns) { if ( $line =~ m/($p)/ ) { return $1; } } } cmpthese(-5, { regex => sub { my $r = regex($_) for @lines }, xorstr => sub { my $r = xorstr($_) for @lines }, } ); __END__ Rate regex xorstr regex 17.3/s -- -82% xorstr 97.3/s 462% --
        Eliya, Wow, thats interesting. Thanks for the script and benchmark. I'll modify it for my data file and try the same benchmarking and report you back. thanks again.