in reply to matching with inclusion

No, but it's easiy to build such a regexp.
my $word = 'sigma'; my $re = join '.*?', map quotemeta, split(//, $word); foreach my $text ( 'sigma', 'stigma', 'magma', ) { print("$text ", ($text =~ $re ? "matches" : "doesn't match"), "\n") +; }

Replies are listed 'Best First'.
Re^2: matching with inclusion
by RMGir (Prior) on Jul 14, 2007 at 20:40 UTC
    ikegami,

    Very nice!

    When I saw your answer, I wondered whether a bunch of index calls would be faster. I didn't think so, but it was worth a shot. (But see at the end -- the index approach has BETTER worst case behaviour...)

    The approach I was proposing was:

    sub findInterleavedString { my $string=$_[0]; my $word=$_[1]; my @letters=split //,$word; my $index=0; for(my $i=0;$i<=$#letters && $index!=-1;++$i) { $index=index $string, $letters[$i], $index+1; } return $index!=-1; }

    I tested 2 versions of each approach: ikegami and rmgir "precompute" the regex (or the array) for sigma, while the _noprep variants don't.

    The results confirm that your approach kicks index all around the room :)

                       Rate  rmgir_noprep ikegami_noprep         rmgir       ikegami
    rmgir_noprep    24065/s            --           -65%          -71%          -92%
    ikegami_noprep  69743/s          190%             --          -16%          -76%
    rmgir           82694/s          244%            19%            --          -72%
    ikegami        295753/s         1129%           324%          258%            --
    
    This makes sense, of course, since the iteration from one "index" call to the next happens in perl, and the regex is all one opcode.

    But it was a fun Saturday afternoon experiment...

    Edit: Note that for REALLY pathological strings, the index approach still performs reasonably well, while the regex can become very slow or hang... For example, if matching against

    	('a' x 110) .
    	'stigm' .
    	('stigm' x 100);
    
    (which doesn't match), the regex approaches take about 2 seconds to fail 1 call, while the index approach can still make over 10000 calls per second.

    So if you have to worry about a hostile user feeding your program bad strings, for example, and your maximum performance requirement isn't too high, then the index approach might be better...

    Added missing +1 to index last argument, and note about worst case behaviour...


    Mike
Re^2: matching with inclusion
by Anno (Deacon) on Jul 15, 2007 at 08:46 UTC
    That allows arbitrary insertions anywhere in the string, so "silly gamma" would match "sigma". The requirement
    "sigma" matches "sig<SOMETHING>ma"
    can also be read as allowing for only a single insertion. That looks simpler, but the regex is harder to build. Here is one way:
    my $word = 'sigma'; my $re = join '|' => map join( '.*?' => map quotemeta, @$_) => map [ substr( $word, 0, $_), substr( $word, $_)] => 1 .. length( $word) - 1; foreach my $text ( 'sigma', 'stigma', 'silly gamma', 'magma', ) { print("$text ", ($text =~ $re ? "matches" : "doesn't match"), "\n") +; }
    It involves splitting $word in all pairs of non-empty substrings, something that comes up occasionally. I have yet to find a satisfactory solution for that. Using substr() like I did above is clear enough, but lengthy.
    map { [ $word =~ /(.{$_})(.*)/] } 1 .. length( $word) - 1;
    is shorter but less readable.

    What else?

    Update: No improvement either way, but fun:

    map { ++ pos; [ split /\G/] } ( $word) x ( length( $word) - 1);

    Anno