matching with inclusion

vitaly has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: matching with inclusion by ikegami (Patriarch) on Jul 14, 2007 at 18:16 UTC
No, but it's easiy to build such a regexp. `my $word = 'sigma'; my $re = join '.*?', map quotemeta, split(//, $word); foreach my $text ( 'sigma', 'stigma', 'magma', ) { print("$text ", ($text =~ $re ? "matches" : "doesn't match"), "\n") +; }` [download]	[reply] [d/l]
Re^2: matching with inclusion by RMGir (Prior) on Jul 14, 2007 at 20:40 UTC
ikegami, Very nice! When I saw your answer, I wondered whether a bunch of index calls would be faster. I didn't think so, but it was worth a shot. (But see at the end -- the index approach has BETTER worst case behaviour...) The approach I was proposing was: `sub findInterleavedString { my $string=$_[0]; my $word=$_[1]; my @letters=split //,$word; my $index=0; for(my $i=0;$i<=$#letters && $index!=-1;++$i) { $index=index $string, $letters[$i], $index+1; } return $index!=-1; }` [download] I tested 2 versions of each approach: ikegami and rmgir "precompute" the regex (or the array) for sigma, while the _noprep variants don't. Read more... See the code (2 kB) The results confirm that your approach kicks index all around the room :) Rate rmgir_noprep ikegami_noprep rmgir ikegami rmgir_noprep 24065/s -- -65% -71% -92% ikegami_noprep 69743/s 190% -- -16% -76% rmgir 82694/s 244% 19% -- -72% ikegami 295753/s 1129% 324% 258% -- This makes sense, of course, since the iteration from one "index" call to the next happens in perl, and the regex is all one opcode. But it was a fun Saturday afternoon experiment... Edit: Note that for REALLY pathological strings, the index approach still performs reasonably well, while the regex can become very slow or hang... For example, if matching against ('a' x 110) . 'stigm' . ('stigm' x 100); (which doesn't match), the regex approaches take about 2 seconds to fail 1 call, while the index approach can still make over 10000 calls per second. So if you have to worry about a hostile user feeding your program bad strings, for example, and your maximum performance requirement isn't too high, then the index approach might be better... Added missing +1 to index last argument, and note about worst case behaviour... Mike	[reply] [d/l] [select]
Re^2: matching with inclusion by Anno (Deacon) on Jul 15, 2007 at 08:46 UTC
That allows arbitrary insertions anywhere in the string, so "silly gamma" would match "sigma". The requirement "sigma" matches "sig<SOMETHING>ma" can also be read as allowing for only a single insertion. That looks simpler, but the regex is harder to build. Here is one way: `my $word = 'sigma'; my $re = join '\|' => map join( '.?' => map quotemeta, @$_) => map [ substr( $word, 0, $_), substr( $word, $_)] => 1 .. length( $word) - 1; foreach my $text ( 'sigma', 'stigma', 'silly gamma', 'magma', ) { print("$text ", ($text =~ $re ? "matches" : "doesn't match"), "\n") +; }` [download] It involves splitting $word in all pairs of non-empty substrings, something that comes up occasionally. I have yet to find a satisfactory solution for that. Using `substr()` like I did above is clear enough, but lengthy. `map { [ $word =~ /(.{$_})(.)/] } 1 .. length( $word) - 1;` [download] is shorter but less readable. What else? Update: No improvement either way, but fun: `map { ++ pos; [ split /\G/] } ( $word) x ( length( $word) - 1);` [download] Anno	[reply] [d/l] [select]
Re: matching with inclusion by shmem (Chancellor) on Jul 15, 2007 at 10:17 UTC
that for example "sigma" matches "sig<SOMETHING>ma"? why, of course. But that depends on what you mean by <SOMETHING> which is a period in regexp speak, followed optionally by a quantifier (see perlre): `$text =~ /sig.*?ma/; # matches sigma and sigima # (zero or more somethings, non-greedy match) $text =~ /sig.+?ma/; # matches sigoma and sigilima # (1 or more somethings, non-greedy match) $text =~ /sig.?ma/; # matches sigma and sigima, but not sigilima # (zero or 1 something) $text =~ /sig.ma/; # just one something` [download] You probably know all of this, but that would be the answer to your question as written, without assuming that there's more to it than what you have expressed. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l]
Re: matching with inclusion by grashoper (Monk) on Jul 18, 2007 at 23:23 UTC
I believe you can use an asterisk in the middle of your regex to match the pattern..something like +=~sig*ma/g	[reply]