in reply to Re^4: Nonrepeating characters in an RE
in thread Nonrepeating characters in an RE

I was quite pleasantly surprised its speed. I notice, however, that the code runs about three times more slowly under 5.30 than under 5.8. I assume this is due to the many modifications made to the regex engine over the years to accommodate Unicode. Any comment on this would be of interest.

3x slower seems an awful lot. Are both perls built with the same options?

Unicode support could be part of it: it would be worth redoing the timings under 5.30 supplying either /a or /aa as a flag on the regexp. However for these patterns (ASCII input and pattern, no use of \w-style classes) I'd expect the cost between versions to be low, and the benefit of the /a flags to be zero.

There have also been numerous small changes added in recent years as a result of bugs (mostly found by fuzzers) that had the potential to be security holes. Such changes almost always make things a tiny bit slower, and as you accumulate more and more of them they add up. So it might also be informative to run the regexp tests from 5.30 under 5.8 (which might need some adaptation) to see a sample of the bugs 5.30 doesn't have.

If you're doing timings, I'd also be interested how my code from 11146164 compares - I'd expect it to win a lot by avoiding the embedded code block, and give back a fraction of that by doing more backtracking. (You should add at least the /s flag to my qr{} for proper comparison.)

# (what is proper match behavior of empty template?)

Since we're assuming implicit anchors, I think it should just match the empty string. That wasn't a case I attempted to handle though, and I doubt the OP cares about it.

Replies are listed 'Best First'.
Re^6: Nonrepeating characters in an RE (performance)
by LanX (Saint) on Aug 21, 2022 at 18:29 UTC
    > I'd expect it to win a lot

    This will largely depend on the dictionary and the test samples.

    With post filtering a pattern like "adieu" would produce many false positives to be filtered, but the result set would be huge anyway.°

    But a pattern like "Mississippi" would produce no false positives.

    But avoiding some false positives with lookaheads might turn out to be more expensive than filtering them a posteriori.

    Assuming the OP is only looking for some matches instead of all matches, I'm confident my approach does better.

    Or at least good enough to be justified by simpler code.

    FWIW: prefer the non-embedded version.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    Update

    °) all 5 letter words in the dictionary ( not 26**5 ) minus those with at least one repetition. Probably one thousand in English?