in reply to Re: Performance optimization question
in thread Performance optimization question

Not always:

$s = 'x'x3000 . 'reg exp' . 'x'x3000; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[ $x = index $s, 'reg exp';] };; Rate INDEX REGEX INDEX 40470/s -- -74% REGEX 156392/s 286% --

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^3: Performance optimization question
by moritz (Cardinal) on Apr 03, 2008 at 07:55 UTC
    Are you sure the regex engine isn't "cheating" by caching something?
    #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(time); my $num = 3_000_000; my $str = 'x'x$num . 'reg exp' . 'x'x$num; my $start = time; my $res = $str =~ m/reg exp/; print time - $start, $/; $start = time; my $idx = index $str, 'reg exp'; print time - $start, $/; __END__ 0.0114099979400635 0.011760950088501

    Admittedly, index is still a bit slower, but the difference isn't that huge.

    BTW on my machine (with perl 5.8.8) the difference isn't there at all:

    Rate REGEX INDEX REGEX 87061/s -- -2% INDEX 88494/s 2% --

    The results only differ slightly for 5.10.0. Which perl did you use?

    I thought that index and regexes use the same algorithm, but the regex goes through the pain of compiling the regex first

      Which perl did you use?
      c:\test>perl -v This is perl, v5.8.6 built for MSWin32-x86-multi-thread (with 3 registered patches, see perl -V for more detail) Copyright 1987-2004, Larry Wall Binary build 811 provided by ActiveState Corp. http://www.ActiveState. +com ActiveState is a division of Sophos. Built Dec 13 2004 09:52:01

      Things do seem to have swung the other way again with 5.10:

      for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[ $x = index $s, 'reg exp';] } };; Rate REGEX INDEX REGEX 2526899/s -- -27% INDEX 3444838/s 36% -- Rate REGEX INDEX REGEX 1083665/s -- -9% INDEX 1193837/s 10% -- Rate REGEX INDEX REGEX 159444/s -- -3% INDEX 163587/s 3% -- Rate REGEX INDEX REGEX 14881/s -- -3% INDEX 15292/s 3% -- for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; study $s; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[ $x = index $s, 'reg exp';] } };; Rate REGEX INDEX REGEX 2103160/s -- -41% INDEX 3552278/s 69% -- Rate REGEX INDEX REGEX 806412/s -- -32% INDEX 1187789/s 47% -- Rate REGEX INDEX REGEX 110166/s -- -33% INDEX 163886/s 49% -- Rate REGEX INDEX REGEX 11587/s -- -24% INDEX 15292/s 32% --

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Performance optimization question
by ikegami (Patriarch) on Apr 03, 2008 at 07:16 UTC
    That's odd. Aren't they suppose to be using the same algorithm internally?

      I *think* that a regex will apply Boyer-Moore automatically, whereas index will only apply it if you study the string. That's how I interpret these results:

      for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[ $x = index $s, 'reg exp';] }; };; Rate INDEX REGEX INDEX 1956298/s -- -19% REGEX 2425347/s 24% -- Rate INDEX REGEX INDEX 340657/s -- -65% REGEX 973582/s 186% -- Rate INDEX REGEX INDEX 40348/s -- -74% REGEX 155495/s 285% -- Rate INDEX REGEX INDEX 3530/s -- -77% REGEX 15077/s 327% --

      With study

      for( 30, 300, 3000, 30000 ) { $s = 'x'x$_ . 'reg exp' . 'x'x$_; study $s; cmpthese -1, { REGEX => q[ $x = $s =~ /reg exp/;], INDEX => q[$x = index $s, 'reg exp';] } };; Rate REGEX INDEX REGEX 2101154/s -- -28% INDEX 2912801/s 39% -- Rate REGEX INDEX REGEX 829929/s -- -32% INDEX 1224656/s 48% -- Rate REGEX INDEX REGEX 100781/s -- -38% INDEX 161881/s 61% -- Rate REGEX INDEX REGEX 10983/s -- -26% INDEX 14792/s 35% --

      Once you study the string, index wins out (marginally as the length of the string increases), which I attbribute to the saving of not having to invoke the regex engine. The saving decreases with length as the regex startup costs are amortised over the greater length.

      Without the study, index does badly because it does a dumb search rather than an intelligent one. That makes regex get quicker with length.

      But don't be surprised if dave_the_m or demerphq pop by and point out that I'm misinterpreting.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Performance optimization question
by parv (Parson) on Apr 03, 2008 at 08:03 UTC
    General request|comment: Could we have a failing test too in benchmarks (at least for regex, index, and such)?
      Sure. Go ahead, provide one.

      In general this is a really good idea, especially for regexes that might do some backtracking. In this case it won't happen, so it's not very exiting:

      # perl 5.10.0 # match: Rate REGEX INDEX REGEX 87061/s -- -1% INDEX 87682/s 1% -- # fail: Rate REGEX INDEX REGEX 43840/s -- -1% INDEX 44246/s 1% --

      Since the found text is in the middle it takes twice as long when there is no match.

        After all examples given by all of you I decided that the best way for me to improve my performance is
        1) to use regex for forming array
        2) regex should be as simple as possible which can be done by adjusting it to my particular case

        So could you confirm that it will always work correctly:
        my string is always like
        143,2|3674,9|12,5|......|145,9
        so in |X,Y| Y is always single digit
        and I want to have an array of pairs
        @arr = (X1,Y1)(X2,Y2)... Then the simplest way is:
        $r = 9; # or any single number my @arr = $string =~ /([^,|]+\,$r)/g;
        It actually works but am I right?