in reply to Dot star okay, or not?

You rang? ;)

The problem with the dot star in your regex is in how it's used. Since you are using minimal matching, it should be quicker than a greedy expression with all of its backtracking, but you've chosen to match to the end of the string, so you have to backtrack to find out where the spaces start, thus making this regex inefficient.

A couple of monks advocated a solution similar to the following:

$data =~ s/^\s*//; $data =~ s/\s*$//;

That solution works and it's faster than what you have listed, but since it matches zero or more spaces, it will always do a substitution, even if there is nothing to substitute. Try changing the asterisk to a plus and it will run much faster. The proof is in the Benchmark:

use Benchmark; sub dotstar { my $data = $testdata; $data =~ s/^\s*(.*?)\s*$/$1/; return $data; } sub first_n_last { my $data = $testdata; $data =~ s/^\s*//; $data =~ s/\s*$//; return $data; } sub first_n_last_must_match { my $data = $testdata; $data =~ s/^\s+//; $data =~ s/\s+$//; return $data; } $testdata = ' ' x 200 . "abcd" x 20 . " " x 200; timethese( 100000, { dotstar => '&dotstar', first_n_last_1 => '&first_n_last', first_n_last_2 => '&first_n_last_must_match' } )

That produces the following results:

Benchmark: timing 100000 iterations of dotstar, first_n_last_1, first_ +n_last_2... dotstar: 7 wallclock secs ( 6.91 usr + 0.02 sys = 6.93 CPU) @ 14 +430.01/s (n=100000) first_n_last_1: 4 wallclock secs ( 4.21 usr + 0.00 sys = 4.21 CPU) +@ 23775.56/s (n=100000) first_n_last_2: 2 wallclock secs ( 1.30 usr + 0.00 sys = 1.30 CPU) +@ 76804.92/s (n=100000)

Usual disclaimer: Don't forget that a general rule is not an inflexible one. The mileage you get out of various solutions may vary. Your regex is fine if you're only testing a couple of lines and aren't worried about performance. It's easy to read and I wouldn't sweat it. If, however, you're working with large data sets, you probably want the faster solutions.

Cheers,
Ovid

Vote for paco!

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Replies are listed 'Best First'.
Re(2): Dot star okay, or not?
by Cirollo (Friar) on Jul 05, 2001 at 21:45 UTC
    Can you tell my why physi's code is slower than the rest? He suggested this:

    $data =~ s/(^\s*|\s*$)//g;

    I added these two subs to your benchmark:

    sub both_at_once { my $data = $testdata; $data =~ s/(^\s+|\s+$)//g; return $data; } sub both_at_once2 { my $data = $testdata; $data =~ s/(^\s*|\s*$)//g; return $data; }
    And this was the result:

    Benchmark: timing 100000 iterations of both_at_once, both_at_once2, do +tstar, first_n_last_1, first_n_last_2... both_at_once: 10 wallclock secs ( 9.04 usr + 0.00 sys = 9.04 CPU) @ +11061.95/s (n=100000) both_at_once2: 11 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @ + 9615.38/s (n=100000) dotstar: 9 wallclock secs ( 8.30 usr + 0.00 sys = 8.30 CPU) @ 12 +048.19/s (n=100000) first_n_last_1: 6 wallclock secs ( 5.77 usr + 0.00 sys = 5.77 CPU) +@ 17331.02/s (n=100000) first_n_last_2: 2 wallclock secs ( 2.31 usr + 0.00 sys = 2.31 CPU) +@ 43290.04/s (n=100000)
    Unless I'm mistaken, the pattern alternation (^\s+|\s+$) will try to match both patterns on every character. But, does the engine not know to disregard the ^\s+ except at the beginning of the string, and likewise for \s+$, only trying to match at the end? Just curious as to why this is so slow.

      If you really want to get a good handle on how regular expressions work, try reading "Mastering Regular Expressions" by Jeffrey Friedl. Further, you can try the re pragma to see the regex engine at work:

      use strict; use re 'debug'; my $string = 'abcdC'; print "Matched: $1\n" if $string =~ /((?<!b)[cC])/;

      Try various strings and regexes and you'll begin to understand that output. The nice thing is that this will also show you some of the optimizations that the regex engine performs.

      Cheers,
      Ovid

      Vote for paco!

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      To answer you, no, Perl doesn't optimize your regex to look only at the beginning and end of the string. Sorry.

      japhy -- Perl and Regex Hacker