vitaly has asked for the wisdom of the Perl Monks concerning the following question:

Is there such a match option in perl Reg. Exp. that for example "sigma" matches "sig<SOMETHING>ma"? Thanks in advance, Vitaly

Replies are listed 'Best First'.
Re: matching with inclusion
by ikegami (Patriarch) on Jul 14, 2007 at 18:16 UTC
    No, but it's easiy to build such a regexp.
    my $word = 'sigma'; my $re = join '.*?', map quotemeta, split(//, $word); foreach my $text ( 'sigma', 'stigma', 'magma', ) { print("$text ", ($text =~ $re ? "matches" : "doesn't match"), "\n") +; }
      ikegami,

      Very nice!

      When I saw your answer, I wondered whether a bunch of index calls would be faster. I didn't think so, but it was worth a shot. (But see at the end -- the index approach has BETTER worst case behaviour...)

      The approach I was proposing was:

      sub findInterleavedString { my $string=$_[0]; my $word=$_[1]; my @letters=split //,$word; my $index=0; for(my $i=0;$i<=$#letters && $index!=-1;++$i) { $index=index $string, $letters[$i], $index+1; } return $index!=-1; }

      I tested 2 versions of each approach: ikegami and rmgir "precompute" the regex (or the array) for sigma, while the _noprep variants don't.

      The results confirm that your approach kicks index all around the room :)

                         Rate  rmgir_noprep ikegami_noprep         rmgir       ikegami
      rmgir_noprep    24065/s            --           -65%          -71%          -92%
      ikegami_noprep  69743/s          190%             --          -16%          -76%
      rmgir           82694/s          244%            19%            --          -72%
      ikegami        295753/s         1129%           324%          258%            --
      
      This makes sense, of course, since the iteration from one "index" call to the next happens in perl, and the regex is all one opcode.

      But it was a fun Saturday afternoon experiment...

      Edit: Note that for REALLY pathological strings, the index approach still performs reasonably well, while the regex can become very slow or hang... For example, if matching against

      	('a' x 110) .
      	'stigm' .
      	('stigm' x 100);
      
      (which doesn't match), the regex approaches take about 2 seconds to fail 1 call, while the index approach can still make over 10000 calls per second.

      So if you have to worry about a hostile user feeding your program bad strings, for example, and your maximum performance requirement isn't too high, then the index approach might be better...

      Added missing +1 to index last argument, and note about worst case behaviour...


      Mike
      That allows arbitrary insertions anywhere in the string, so "silly gamma" would match "sigma". The requirement
      "sigma" matches "sig<SOMETHING>ma"
      can also be read as allowing for only a single insertion. That looks simpler, but the regex is harder to build. Here is one way:
      my $word = 'sigma'; my $re = join '|' => map join( '.*?' => map quotemeta, @$_) => map [ substr( $word, 0, $_), substr( $word, $_)] => 1 .. length( $word) - 1; foreach my $text ( 'sigma', 'stigma', 'silly gamma', 'magma', ) { print("$text ", ($text =~ $re ? "matches" : "doesn't match"), "\n") +; }
      It involves splitting $word in all pairs of non-empty substrings, something that comes up occasionally. I have yet to find a satisfactory solution for that. Using substr() like I did above is clear enough, but lengthy.
      map { [ $word =~ /(.{$_})(.*)/] } 1 .. length( $word) - 1;
      is shorter but less readable.

      What else?

      Update: No improvement either way, but fun:

      map { ++ pos; [ split /\G/] } ( $word) x ( length( $word) - 1);

      Anno

Re: matching with inclusion
by shmem (Chancellor) on Jul 15, 2007 at 10:17 UTC
    that for example "sigma" matches "sig<SOMETHING>ma"?

    why, of course. But that depends on what you mean by <SOMETHING> which is a period in regexp speak, followed optionally by a quantifier (see perlre):

    $text =~ /sig.*?ma/; # matches sigma and sigima # (zero or more somethings, non-greedy match) $text =~ /sig.+?ma/; # matches sigoma and sigilima # (1 or more somethings, non-greedy match) $text =~ /sig.?ma/; # matches sigma and sigima, but not sigilima # (zero or 1 something) $text =~ /sig.ma/; # just one something

    You probably know all of this, but that would be the answer to your question as written, without assuming that there's more to it than what you have expressed.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: matching with inclusion
by grashoper (Monk) on Jul 18, 2007 at 23:23 UTC
    I believe you can use an asterisk in the middle of your regex to match the pattern..something like +=~sig*ma/g