in reply to Performance optimization question

If reg exp is meant literally, you can use index to search for the literal substring - it's faster than a regular expression.

Replies are listed 'Best First'.
Re^2: Performance optimization question
by BrowserUk (Patriarch) on Apr 03, 2008 at 06:58 UTC
      Are you sure the regex engine isn't "cheating" by caching something?
      #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(time); my $num = 3_000_000; my $str = 'x'x$num . 'reg exp' . 'x'x$num; my $start = time; my $res = $str =~ m/reg exp/; print time - $start, $/; $start = time; my $idx = index $str, 'reg exp'; print time - $start, $/; __END__ 0.0114099979400635 0.011760950088501

      Admittedly, index is still a bit slower, but the difference isn't that huge.

      BTW on my machine (with perl 5.8.8) the difference isn't there at all:

      Rate REGEX INDEX REGEX 87061/s -- -2% INDEX 88494/s 2% --

      The results only differ slightly for 5.10.0. Which perl did you use?

      I thought that index and regexes use the same algorithm, but the regex goes through the pain of compiling the regex first

        Which perl did you use?
        c:\test>perl -v This is perl, v5.8.6 built for MSWin32-x86-multi-thread (with 3 registered patches, see perl -V for more detail) Copyright 1987-2004, Larry Wall Binary build 811 provided by ActiveState Corp. http://www.ActiveState. +com ActiveState is a division of Sophos. Built Dec 13 2004 09:52:01

        Things do seem to have swung the other way again with 5.10:

        for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[ $x = index $s, 'reg exp';] } };; Rate REGEX INDEX REGEX 2526899/s -- -27% INDEX 3444838/s 36% -- Rate REGEX INDEX REGEX 1083665/s -- -9% INDEX 1193837/s 10% -- Rate REGEX INDEX REGEX 159444/s -- -3% INDEX 163587/s 3% -- Rate REGEX INDEX REGEX 14881/s -- -3% INDEX 15292/s 3% -- for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; study $s; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[ $x = index $s, 'reg exp';] } };; Rate REGEX INDEX REGEX 2103160/s -- -41% INDEX 3552278/s 69% -- Rate REGEX INDEX REGEX 806412/s -- -32% INDEX 1187789/s 47% -- Rate REGEX INDEX REGEX 110166/s -- -33% INDEX 163886/s 49% -- Rate REGEX INDEX REGEX 11587/s -- -24% INDEX 15292/s 32% --

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      That's odd. Aren't they suppose to be using the same algorithm internally?

        I *think* that a regex will apply Boyer-Moore automatically, whereas index will only apply it if you study the string. That's how I interpret these results:

        for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[ $x = index $s, 'reg exp';] }; };; Rate INDEX REGEX INDEX 1956298/s -- -19% REGEX 2425347/s 24% -- Rate INDEX REGEX INDEX 340657/s -- -65% REGEX 973582/s 186% -- Rate INDEX REGEX INDEX 40348/s -- -74% REGEX 155495/s 285% -- Rate INDEX REGEX INDEX 3530/s -- -77% REGEX 15077/s 327% --

        With study

        for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; study $s; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[$x = index $s, 'reg exp';] } };; Rate REGEX INDEX REGEX 2101154/s -- -28% INDEX 2912801/s 39% -- Rate REGEX INDEX REGEX 829929/s -- -32% INDEX 1224656/s 48% -- Rate REGEX INDEX REGEX 100781/s -- -38% INDEX 161881/s 61% -- Rate REGEX INDEX REGEX 10983/s -- -26% INDEX 14792/s 35% --

        Once you study the string, index wins out (marginally as the length of the string increases), which I attbribute to the saving of not having to invoke the regex engine. The saving decreases with length as the regex startup costs are amortised over the greater length.

        Without the study, index does badly because it does a dumb search rather than an intelligent one. That makes regex get quicker with length.

        But don't be surprised if dave_the_m or demerphq pop by and point out that I'm misinterpreting.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      General request|comment: Could we have a failing test too in benchmarks (at least for regex, index, and such)?
        Sure. Go ahead, provide one.

        In general this is a really good idea, especially for regexes that might do some backtracking. In this case it won't happen, so it's not very exiting:

        # perl 5.10.0 # match: Rate REGEX INDEX REGEX 87061/s -- -1% INDEX 87682/s 1% -- # fail: Rate REGEX INDEX REGEX 43840/s -- -1% INDEX 44246/s 1% --

        Since the found text is in the middle it takes twice as long when there is no match.