Re^3: generating hash patterns for searching with one mismatch

but I doubt its efficient for my case

I haven't benchmarked it, but I guess a comparison with your regex approach wouldn't turn out all that bad.

A regex pattern match also has to go through the entire string (or up until the first match). And you have to do it N times (21 in your example). The internal regex implementation, however, has the advantage of being fast, because it's written in C.

So it's essentially going through the string once, but slowly (iteration done in Perl), as opposed to going through the string 21 times, but fast...

The XOR method would also lend itself particularly well to being implemented in C (e.g. via Inline::C), in case you should need real speed.

Update: it turns out that even the Perl implementation of the XOR method is more than 4 times as fast:

#!/usr/bin/perl -w
use strict;
use Benchmark "cmpthese";

my $match = "abcdefghijklmnopqrst";
my $matchlen = length($match);
my $linelen = 50;
my @lines;
my $i = 0;
for (1..1e3) {
    my $line = join '', map chr(32+rand 96), 1..$linelen;
    my $s = $match;
    substr($s, $i++ % $matchlen, 1) = 'X';
    substr($line, rand $linelen-$matchlen, $matchlen) = $s;
    push @lines, $line;
}
my @patterns = ($match);
for my $i (0..$matchlen-1) {
    my $p = $match;
    substr($p, $i, 1) = ".";
    push @patterns, qr/$p/;
}

sub xorstr {
    my $line = shift;
    for my $i (0..$linelen-$matchlen) {
        my $s = substr $line, $i, $matchlen;
        my $x = $s ^ $match;
        my $diff = $x =~ tr/\0//c;
        return $s unless $diff > 1;
    }
    
}
sub regex {
    my $line = shift;
    for my $p (@patterns) {
        if ( $line =~ m/($p)/ ) {
            return $1;
        }
    }
}

cmpthese(-5,
    {
        regex  => sub { my $r = regex($_)  for @lines },
        xorstr => sub { my $r = xorstr($_) for @lines },
    }
);

__END__
         Rate  regex xorstr
regex  17.3/s     --   -82%
xorstr 97.3/s   462%     --
[download]

Comment on Re^3: generating hash patterns for searching with one mismatch Download Code

Replies are listed 'Best First'.
Re^4: generating hash patterns for searching with one mismatch by cedance (Novice) on Mar 18, 2011 at 09:11 UTC
Eliya, Wow, thats interesting. Thanks for the script and benchmark. I'll modify it for my data file and try the same benchmarking and report you back. thanks again.	[reply]