in reply to Re^2: generating hash patterns for searching with one mismatch
in thread generating hash patterns for searching with one mismatch

but I doubt its efficient for my case

I haven't benchmarked it, but I guess a comparison with your regex approach wouldn't turn out all that bad.

A regex pattern match also has to go through the entire string (or up until the first match). And you have to do it N times (21 in your example). The internal regex implementation, however, has the advantage of being fast, because it's written in C.

So it's essentially going through the string once, but slowly (iteration done in Perl), as opposed to going through the string 21 times, but fast...

The XOR method would also lend itself particularly well to being implemented in C (e.g. via Inline::C), in case you should need real speed.

Update: it turns out that even the Perl implementation of the XOR method is more than 4 times as fast:

#!/usr/bin/perl -w use strict; use Benchmark "cmpthese"; my $match = "abcdefghijklmnopqrst"; my $matchlen = length($match); my $linelen = 50; my @lines; my $i = 0; for (1..1e3) { my $line = join '', map chr(32+rand 96), 1..$linelen; my $s = $match; substr($s, $i++ % $matchlen, 1) = 'X'; substr($line, rand $linelen-$matchlen, $matchlen) = $s; push @lines, $line; } my @patterns = ($match); for my $i (0..$matchlen-1) { my $p = $match; substr($p, $i, 1) = "."; push @patterns, qr/$p/; } sub xorstr { my $line = shift; for my $i (0..$linelen-$matchlen) { my $s = substr $line, $i, $matchlen; my $x = $s ^ $match; my $diff = $x =~ tr/\0//c; return $s unless $diff > 1; } } sub regex { my $line = shift; for my $p (@patterns) { if ( $line =~ m/($p)/ ) { return $1; } } } cmpthese(-5, { regex => sub { my $r = regex($_) for @lines }, xorstr => sub { my $r = xorstr($_) for @lines }, } ); __END__ Rate regex xorstr regex 17.3/s -- -82% xorstr 97.3/s 462% --

Replies are listed 'Best First'.
Re^4: generating hash patterns for searching with one mismatch
by cedance (Novice) on Mar 18, 2011 at 09:11 UTC
    Eliya, Wow, thats interesting. Thanks for the script and benchmark. I'll modify it for my data file and try the same benchmarking and report you back. thanks again.