in reply to is split optimized?

It should if you give the third argument to split:
my $x = "a b c d e f g"; my $y = (split /\s+/, $x, 2)[0];
In this case split should only split your string once, and after it's seen the first \s+ it should stop. Now, I can't say that it will actually *do that*, but that's what it *seems* should happen.

From the split docs:

split /PATTERN/, EXPR, LIMIT ... If LIMIT is specified and positive, splits into no more than that many fields (though it may split into fewer).

Replies are listed 'Best First'.
RE: Re: is split optimized?
by eduardo (Curate) on Jul 14, 2000 at 06:37 UTC
    well, according to the benchmark:
    #!/usr/bin/perl -w use strict; use Benchmark; my $x = "a b c d e f g"; sub list_context { my $y = (split(/\s+/, $x))[0]; } sub extra_argument { my $y = (split(/\s+/, $x, 2))[0]; } timethese(-3, { "LIST CONTEXT" => \&list_context, "EXTRA ARGUMENT" => \&extra_argument, }); [ed@darkness ed]$ perl ./splittest.pl Benchmark: running EXTRA ARGUMENT, LIST CONTEXT, each for at least 3 C +PU seconds... EXTRA ARGUMENT: 2 wallclock secs ( 3.14 usr + 0.00 sys = 3.14 CPU) +@ 115563.69/s (n=362870) LIST CONTEXT: 3 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ +57053.46/s (n=181430)
    the extra argument version of split kicks the living crap out of not using it... so i guess, yes ;) it does make a difference!
Using the third arguement for split
by gryng (Hermit) on Jul 14, 2000 at 06:34 UTC
    It does, here are some numbers:
    Six: 10 wallclock secs ( 4.82 usr + 5.18 sys = 10.00 CPU) @ 43 +7.30/s (n=4373) One: 13 wallclock secs (12.92 usr + 0.01 sys = 12.93 CPU) @ 4 +.87/s (n=63) Five: 13 wallclock secs (13.39 usr + 0.00 sys = 13.39 CPU) @ 5 +.15/s (n=69) Four: 10 wallclock secs ( 4.74 usr + 5.56 sys = 10.30 CPU) @ 44 +3.69/s (n=4570)
    The code for this is:
    #!/usr/bin/perl -w use strict; use Benchmark; my $testlarge = "a " x 100000; my $testsmall = "a b c d e f"; timethese(-10,{ One => sub { my ($y) = (split(/\s+/,$testlarge))[0]; }, Two => sub { my ($y) = (split(/\s+/,$testsmall))[0]; }, Three => sub { $testsmall =~ /([^\s]*)\s+/; my $y = $1; }, Four => sub { $testlarge =~ /([^\s]*)\s+/; my $y = $1; }, Five => sub { my $y = split(/\s+/,$testlarge) }, Six => sub { my ($y) = split(/\s+/,$testlarge,1) } });
    Note that using an equivalent regex is slightly faster, but the split done properly (using the third arguement) preforms at the "correct" level.

    Thanks for submitting the correct answer :) hehe.

    Chow,
    Gryn

      Thanks for submitting the correct answer :) hehe.

      But you used it in an incorrect way. If the third argument is 1, it's effectively a noop. The third argument does not mean to discard everything after the first field.

          my ($y) = split " ", "a b", 1;
          print $y;
      
      will print a b, and not a.

      If you want to use only the first field, and use a third argument, just use:

          my ($y) = split " ", $string, 2;
      
      That's right. No indexing required. But even the limit isn't required. Just the simple:
          my ($y) = split " ", $string;
      
      will do. And because it is so simple, Perl can optimize that. Here's a benchmark program (there are brackets where indexing is used - for some reason, perlmonks strip them), and the results:
      #!/opt/perl/bin/perl -w
      
      use strict;
      use Benchmark;
      
      my $str = "a " x 6;
      
      timethese -100 => {           
          index   =>  sub {my ($y) = (split " " => $str) [0]},
          regex   =>  sub {my ($y) = $str =~ /(\S+)/},
          limit   =>  sub {my ($y) = (split " " => $str, 2) [0]},
          plain   =>  sub {my ($y) =  split " " => $str},
      }
      
      __END__
      Benchmark: running index, limit, plain, regex, each for at least 100 CPU seconds...
      index: 125 wallclock secs (105.53 usr +  0.00 sys = 105.53 CPU) @ 34487.35/s (n=3639450)
      regex: 121 wallclock secs (105.06 usr +  0.00 sys = 105.06 CPU) @ 43695.61/s (n=4590661)
      limit: 123 wallclock secs (104.03 usr +  0.02 sys = 104.05 CPU) @ 48699.04/s (n=5067135)
      plain: 120 wallclock secs (105.18 usr +  0.02 sys = 105.20 CPU) @ 52044.32/s (n=5475062)
      

      The bottom line is, if you want Perl to do the optimizing, keep your code simple.

      -- Abigail

        You can post code with brackets by enclosing it in a <CODE> </CODE> block. See the Site How To for more information.

        - Matt

        Thanks, I had mistyped my numbers (I was on a old 15" monitor where the font is set to like 3 pixels high and even 640x480 looked fuzzy, yeck).

        Anyway, I noticed that there was some descripency between the relative speeds of regex versus split (that is, split used properly). And I wanted to see why, so first I added a few more tests:

        Four => sub { $testlarge =~ /([^\s]*)\s+/; my $y = $1; }, Eight => sub { my $y = $testlarge =~ /(\S+)/; }, Six => sub { my ($y) = split(/\s+/, $testlarge,2) }, Seven => sub { my ($y) = split(/\s+/, $testlarge) }
        I noticed that your regex was different, so I wanted to see if was why things were slower (however, I didn't think so, since your's was simplier).

        Running with string equal to "a " x 100 000 , I got these numbers:

        Eight: 11 wallclock secs ( 4.68 usr + 5.46 sys = 10.14 CPU) @ 45 +0.69/s (n=4570) Four: 10 wallclock secs ( 4.79 usr + 5.21 sys = 10.00 CPU) @ 44 +9.60/s (n=4496) Seven: 10 wallclock secs ( 7.28 usr + 2.84 sys = 10.12 CPU) @ 29 +3.97/s (n=2975) Six: 10 wallclock secs ( 7.13 usr + 2.88 sys = 10.01 CPU) @ 29 +3.81/s (n=2941)
        Which were (despite using the same regex) were still 50% faster than split, rather than being 40% slower.
        Next I reduced the size of my string to: "a " x 100 . Here I got these numbers:
        Eight: 12 wallclock secs (10.00 usr + 0.00 sys = 10.00 CPU) @ 13 +0970.90/s (n=1309709) Four: 11 wallclock secs (10.39 usr + 0.00 sys = 10.39 CPU) @ 88 +839.36/s (n=923041) Seven: 10 wallclock secs (10.49 usr + 0.00 sys = 10.49 CPU) @ 12 +2499.33/s (n=1285018) Six: 10 wallclock secs (10.54 usr + 0.00 sys = 10.54 CPU) @ 12 +1918.22/s (n=1285018)
        Now the regex code (yours) leads by less than 10%, and my regex trails by a good 30%. So, I guess the conclusion is that regex preforms better than split on large scalars? I don't feel like mucking in the perl source code right now, so my guess as to why this is, has nothing to do with the way regex's or splits actually process the data, but rather that split is probably receiving a copy of the data, whereas regex is receiving a reference.

        Cheers,
        Gryn