http://qs1969.pair.com?node_id=433714

As Perl programmers, we love our regular expressions. It's one of the things that makes Perl so Perly. However, they're not always necessary.

If you're writing something like

if ( $value =~ /^true$/i )
then write it as
if ( lc $value eq "true" )
instead.

xoxo,
Andy

Replies are listed 'Best First'.
Re: You don't always have to use regexes
by kvale (Monsignor) on Feb 23, 2005 at 16:26 UTC
    Using the simplest op that gets the job done is always good advice, both for speed and readability.

    But for those that are addicted to regexes, the above situation won't bite speed too hard. The regex engine optimizes a fixed string to a Boyer Moore match, which is a tad slower than string equality:

    use Benchmark qw(:all) ; my $value = 'FALSE'; my $count = 10_000_000; cmpthese($count, { 'regex' => sub { $value =~ /^true$/i }, 'eq' => sub { lc $value eq "true" }, });
    yields
    Benchmark: timing 10000000 iterations of eq, regex... 1048% perl boyer.pl Benchmark: timing 10000000 iterations of eq, regex... eq: 9 wallclock secs ( 8.98 usr + 0.00 sys = 8.98 CPU) @ 11 +13585.75/s (n=10000000) regex: 16 wallclock secs (16.31 usr + 0.00 sys = 16.31 CPU) @ 61 +3120.78/s (n=10000000) Rate regex eq regex 613121/s -- -45% eq 1113586/s 82% --
    Unless that match is inside a tight loop, program performance will not be too degraded,

    -Mark

Re: You don't always have to use regexes
by spurperl (Priest) on Feb 23, 2005 at 16:02 UTC
    It's quite interesting to time this and see just how much performance is gained. Additionally, I'm curious whether the regex engine has, or planned to have optimizations on "static" expressions like this ?

    Additionally, usage of substr can save quite a few regular expressions here and there.

    But the rule of thumb should be: use whatever seems more natural for the problem at hand, and optimize only if necessary.

Re: You don't always have to use regexes
by VSarkiss (Monsignor) on Feb 23, 2005 at 16:22 UTC
      I think that for the index case the situation is not so clear. Both the regex engine and index() will use the same Boyer-Moore routine and for me personally, the regex version is more readable. But as always, YMMV.
      use Benchmark qw(:all) ; my $value = 'FALSE'; my $count = 1_000_000; cmpthese($count, { 'regex' => sub { $value =~ /^true$/i }, 'eq' => sub { lc $value eq "true" }, 'index' => sub { index( lc $value, "true" ) >= 0 }, });
      yields
      Benchmark: timing 1000000 iterations of eq, index, regex... eq: 1 wallclock secs ( 0.89 usr + 0.00 sys = 0.89 CPU) @ 11 +23595.51/s (n=1000000) index: 2 wallclock secs ( 1.65 usr + 0.00 sys = 1.65 CPU) @ 60 +6060.61/s (n=1000000) regex: 2 wallclock secs ( 1.63 usr + 0.00 sys = 1.63 CPU) @ 61 +3496.93/s (n=1000000) Rate index regex eq index 606061/s -- -1% -46% regex 613497/s 1% -- -45% eq 1123596/s 85% 83% --
      Update: As AM has pointed out (thank you!), the benchmark above has a bug. Using the tests
      'regex' => sub { $value =~ /true/i }, 'regex_anch' => sub { $value =~ /^true$/i }, 'eq' => sub { lc $value eq "true" }, 'index' => sub { index( lc $value, "true" ) >= 0 },
      I get the results
      Benchmark: timing 1000000 iterations of eq, index, regex, regex_anch.. +. eq: 1 wallclock secs ( 0.88 usr + 0.00 sys = 0.88 CPU) @ 11 +36363.64/s (n=1000000) index: 0 wallclock secs ( 1.65 usr + 0.00 sys = 1.65 CPU) @ 60 +6060.61/s (n=1000000) regex: 0 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 92 +5925.93/s (n=1000000) regex_anch: 2 wallclock secs ( 1.59 usr + 0.00 sys = 1.59 CPU) @ 62 +8930.82/s (n=1000000) Rate index regex_anch regex eq index 606061/s -- -4% -35% -47% regex_anch 628931/s 4% -- -32% -45% regex 925926/s 53% 47% -- -19% eq 1136364/s 87% 81% 23% --
      with the surprising result that the regex w/o the anchor is faster than the anchored version. Multiple runs yield similar results. As the AM says, one could try many different regex-value combos, but I expect the results to be not far different, precisely because both index and regex engine use the same BM function.

      -Mark

        You Benchmark is significantly flawed for the question asked. The OR (original replier) wanted to compare  index(lc $value,"true") with  $value =~ /true/i; In addition, to fairly benchmark one should try multiple test case (set  $value to "true", a short string, and a longer string in your test, and in a fair test, set it to: 'true', 'ashortstringthentrue', 'averylongstringthentrue', and different size strings without 'true' in them.
      I benchmarked this and it yields an interesting result. index() is (a bit) faster than a regex. If itīs used in combination with lc(), as in your example, the regex with the i-modifier is faster.
      use strict; use warnings; use Benchmark; my $value = "somewhere here true is there!"; timethese ( 9000000, { 'index' => sub { index( $value, "true" ) }, 'regex' => sub { $value =~ /true/ }, } ); timethese ( 9000000, { 'index' => sub { index( lc $value, "true" ) }, 'regex' => sub { $value =~ /true/i }, } ); Benchmark: timing 9000000 iterations of index, regex... index: 2 wallclock secs ( 2.02 usr + 0.00 sys = 2.02 CPU) @ 44 +46640.32/s (n=9000000) regex: 4 wallclock secs ( 2.40 usr + -0.01 sys = 2.39 CPU) @ 37 +60969.49/s (n=9000000) Benchmark: timing 9000000 iterations of index, regex... index: 4 wallclock secs ( 4.55 usr + 0.00 sys = 4.55 CPU) @ 19 +79762.43/s (n=9000000) regex: 3 wallclock secs ( 3.68 usr + 0.00 sys = 3.68 CPU) @ 24 +48313.38/s (n=9000000)


      Update:
      Ack. I really need to learn to type faster.


      holli, /regexed monk/
Re: You don't always have to use regexes
by Anonymous Monk on Feb 24, 2005 at 03:32 UTC

    Code Smarter: Compulsory linke to Japhy's node making the same sugestion, and more.

    Edited by davido: fixed broken link.

Re: You don't always have to use regexes
by ysth (Canon) on Feb 24, 2005 at 19:34 UTC
    A proper translation of if ( $value =~ /^true$/i ) would be:
    if ( lc $value eq "true" || lc $value eq "true\n" )
    (except that the former potentially sets $&, $`, and $' and the last-successful-regex).
      Yes, but that check for "\n" is really irrelevant. It's required to be functionally identically, but not semantically.

      Semantics are the real issue here. The regex is saying "Do you have a string that matches the beginning of the string, then t, r, u, e and then the end of the string", and the compare is saying "Is the string the word 'true'?"

      "Is this the word I want" is the real intent.

      xoxo,
      Andy

        My point was that that is not what the regex is saying. Just my own personal bonnet-bee, but people misinterpret $ way too often, and I feel it deserves publicity whenever it comes up.
Re: You don't always have to use regexes
by PetaMem (Priest) on Feb 24, 2005 at 21:57 UTC

    I suppose, the whole meaning of this example is to show how to programm efficiently - not wasting system ressources (here: CPU time).

    If this is so, I'd like to put emphasis on the fact, that NO ONE here seems to see a problem in the "true" expression. Please do not use interpolation if you do not need it. Try your benchmarks with 'true' again.

    Update:

    Of course I did the benchmarks before posting this node. The speed differences are not extraordinary but constantly about 5%

    Bye
     PetaMem
        All Perl:   MT, NLP, NLU

      Actually, many of us saw it. But we also saw this: Re: To Single Quote or to Double Quote: a benchmark. The point is, the difference in speed is practically meaningless. In the grand scheme of the transition from $value =~ /true/i to lc $value eq "true", changing that to lc $value eq 'true' is going to have a demonstrably small effect.

        And to support your point, an invariant string inside double-quotes gets compiled down to a single quoted string. Any time wasted is not wasted at run-time.

        $cat print.pl print 'Hello'; print "Hello"; # compiles to 'Hello' print "Hello $_"; $perl -MO=Deparse print.pl print 'Hello'; print 'Hello'; print "Hello $_"; print.pl syntax OK

        5.005_03, 5.6.1 and 5.8.4 produce identical results.

      I suppose, the whole meaning of this example is to show how to programm efficiently - not wasting system ressources (here: CPU time).

      Absolutely not. That has nothing to do with it. CPU efficiencies on the scale that we're talking about are irrelevant.

      The point is to use the construct that most closely matches the semantics of what you're trying to achieve. If you're wondering if one string is the word "true", then that's not a pattern match, it's a string comparison.

      xoxo,
      Andy

        If you're wondering if one string is the word "true", then that's not a pattern match, it's a string comparison.

        Ok, I second that. Probably I was mislead by the immediate popup of benchmarks in this thread.

        Bye
         PetaMem
            All Perl:   MT, NLP, NLU