vitaly has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: matching with inclusion
by ikegami (Patriarch) on Jul 14, 2007 at 18:16 UTC | |
| [reply] [d/l] |
by RMGir (Prior) on Jul 14, 2007 at 20:40 UTC | |
Very nice! When I saw your answer, I wondered whether a bunch of index calls would be faster. I didn't think so, but it was worth a shot. (But see at the end -- the index approach has BETTER worst case behaviour...) The approach I was proposing was:
I tested 2 versions of each approach: ikegami and rmgir "precompute" the regex (or the array) for sigma, while the _noprep variants don't. The results confirm that your approach kicks index all around the room :)
Rate rmgir_noprep ikegami_noprep rmgir ikegami
rmgir_noprep 24065/s -- -65% -71% -92%
ikegami_noprep 69743/s 190% -- -16% -76%
rmgir 82694/s 244% 19% -- -72%
ikegami 295753/s 1129% 324% 258% --
This makes sense, of course, since the iteration from one "index" call to the next happens in perl, and the regex is all one opcode.
But it was a fun Saturday afternoon experiment... Edit: Note that for REALLY pathological strings, the index approach still performs reasonably well, while the regex can become very slow or hang... For example, if matching against
('a' x 110) .
'stigm' .
('stigm' x 100);
(which doesn't match), the regex approaches take about 2 seconds to fail 1 call, while the index approach can still make over 10000 calls per second.
So if you have to worry about a hostile user feeding your program bad strings, for example, and your maximum performance requirement isn't too high, then the index approach might be better... Added missing +1 to index last argument, and note about worst case behaviour... Mike | [reply] [d/l] [select] |
by Anno (Deacon) on Jul 15, 2007 at 08:46 UTC | |
"sigma" matches "sig<SOMETHING>ma"can also be read as allowing for only a single insertion. That looks simpler, but the regex is harder to build. Here is one way: It involves splitting $word in all pairs of non-empty substrings, something that comes up occasionally. I have yet to find a satisfactory solution for that. Using substr() like I did above is clear enough, but lengthy. is shorter but less readable. What else?
Update: No improvement either way, but fun:
Anno | [reply] [d/l] [select] |
|
Re: matching with inclusion
by shmem (Chancellor) on Jul 15, 2007 at 10:17 UTC | |
that for example "sigma" matches "sig<SOMETHING>ma"? why, of course. But that depends on what you mean by <SOMETHING> which is a period in regexp speak, followed optionally by a quantifier (see perlre):
You probably know all of this, but that would be the answer to your question as written, without assuming that there's more to it than what you have expressed. --shmem
| [reply] [d/l] |
|
Re: matching with inclusion
by grashoper (Monk) on Jul 18, 2007 at 23:23 UTC | |
| [reply] |