Re^2: matching with inclusion

Very nice!

When I saw your answer, I wondered whether a bunch of index calls would be faster. I didn't think so, but it was worth a shot. (But see at the end -- the index approach has BETTER worst case behaviour...)

The approach I was proposing was:


sub findInterleavedString {
    my $string=$_[0];
    my $word=$_[1];
    my @letters=split //,$word;

    my $index=0;
    for(my $i=0;$i<=$#letters && $index!=-1;++$i) {
        $index=index $string, $letters[$i], $index+1;
    }
    return $index!=-1;
}
[download]

I tested 2 versions of each approach: ikegami and rmgir "precompute" the regex (or the array) for sigma, while the _noprep variants don't.

#!/usr/bin/perl -w

use strict;

use Benchmark qw(cmpthese);

{
    my $word="sigma";
    my $re  = join '.*?',
              map quotemeta,
              split (//,$word);

    sub ikegami {
        my $result=1;
        foreach my $text (
            'sigma',
            'stigma',
            'magma',
        ) {
            $result &&= ($text=~$re);
        }
    }
}

sub ikegami_noprep {
    my $word="sigma";
    my $re  = join '.*?',
              map quotemeta,
              split (//,$word);
    my $result=1;
    foreach my $text (
        'sigma',
        'stigma',
        'magma',
    ) {
        $result &&= ($text=~$re);
    }
}

{
    my $word='sigma';
    my @letters=split //,$word;
    sub findSigma {
        my $string=$_[0];
        my $index=0;
        for(my $i=0;$i<=$#letters && $index!=-1;++$i) {
            $index=index $string, $letters[$i], $index+1;
        }
        return $index!=-1;
    }
    sub rmgir {
        my $result=1;
        foreach my $text (
            'sigma',
            'stigma',
            'magma',
        ) {
            $result &&= findSigma($text);
        }
    }
}

sub findInterleavedString {
    my $string=$_[0];
    my $word=$_[1];
    my @letters=split //,$word;

    my $index=0;
    for(my $i=0;$i<=$#letters && $index!=-1;++$i) {
        $index=index $string, $letters[$i], $index+1;
    }
    return $index!=-1;
}

sub rmgir_noprep {
    my $result=1;
    foreach my $text (
        'sigma',
        'stigma',
        'magma',
    ) {
        $result &&= findInterleavedString($text, 'sigma');
    }
}

cmpthese -3, { 
                ikegami => \&ikegami,
                ikegami_noprep => \&ikegami_noprep,
                rmgir => \&rmgir,
                rmgir_noprep => \&rmgir_noprep,
             };
[download]

The results confirm that your approach kicks index all around the room :)

                   Rate  rmgir_noprep ikegami_noprep         rmgir       ikegami
rmgir_noprep    24065/s            --           -65%          -71%          -92%
ikegami_noprep  69743/s          190%             --          -16%          -76%
rmgir           82694/s          244%            19%            --          -72%
ikegami        295753/s         1129%           324%          258%            --

This makes sense, of course, since the iteration from one "index" call to the next happens in perl, and the regex is all one opcode.

But it was a fun Saturday afternoon experiment...

Edit: Note that for REALLY pathological strings, the index approach still performs reasonably well, while the regex can become very slow or hang... For example, if matching against

	('a' x 110) .
	'stigm' .
	('stigm' x 100);

(which doesn't match), the regex approaches take about 2 seconds to fail 1 call, while the index approach can still make over 10000 calls per second.

So if you have to worry about a hostile user feeding your program bad strings, for example, and your maximum performance requirement isn't too high, then the index approach might be better...

Added missing +1 to index last argument, and note about worst case behaviour...

Mike

Comment on Re^2: matching with inclusion Select or Download Code