in reply to Re: is split optimized?
in thread is split optimized?

It does, here are some numbers:
Six: 10 wallclock secs ( 4.82 usr + 5.18 sys = 10.00 CPU) @ 43 +7.30/s (n=4373) One: 13 wallclock secs (12.92 usr + 0.01 sys = 12.93 CPU) @ 4 +.87/s (n=63) Five: 13 wallclock secs (13.39 usr + 0.00 sys = 13.39 CPU) @ 5 +.15/s (n=69) Four: 10 wallclock secs ( 4.74 usr + 5.56 sys = 10.30 CPU) @ 44 +3.69/s (n=4570)
The code for this is:
#!/usr/bin/perl -w use strict; use Benchmark; my $testlarge = "a " x 100000; my $testsmall = "a b c d e f"; timethese(-10,{ One => sub { my ($y) = (split(/\s+/,$testlarge))[0]; }, Two => sub { my ($y) = (split(/\s+/,$testsmall))[0]; }, Three => sub { $testsmall =~ /([^\s]*)\s+/; my $y = $1; }, Four => sub { $testlarge =~ /([^\s]*)\s+/; my $y = $1; }, Five => sub { my $y = split(/\s+/,$testlarge) }, Six => sub { my ($y) = split(/\s+/,$testlarge,1) } });
Note that using an equivalent regex is slightly faster, but the split done properly (using the third arguement) preforms at the "correct" level.

Thanks for submitting the correct answer :) hehe.

Chow,
Gryn

Replies are listed 'Best First'.
RE: Using the third arguement for split
by Abigail (Deacon) on Jul 14, 2000 at 10:09 UTC
    Thanks for submitting the correct answer :) hehe.

    But you used it in an incorrect way. If the third argument is 1, it's effectively a noop. The third argument does not mean to discard everything after the first field.

        my ($y) = split " ", "a b", 1;
        print $y;
    
    will print a b, and not a.

    If you want to use only the first field, and use a third argument, just use:

        my ($y) = split " ", $string, 2;
    
    That's right. No indexing required. But even the limit isn't required. Just the simple:
        my ($y) = split " ", $string;
    
    will do. And because it is so simple, Perl can optimize that. Here's a benchmark program (there are brackets where indexing is used - for some reason, perlmonks strip them), and the results:
    #!/opt/perl/bin/perl -w
    
    use strict;
    use Benchmark;
    
    my $str = "a " x 6;
    
    timethese -100 => {           
        index   =>  sub {my ($y) = (split " " => $str) [0]},
        regex   =>  sub {my ($y) = $str =~ /(\S+)/},
        limit   =>  sub {my ($y) = (split " " => $str, 2) [0]},
        plain   =>  sub {my ($y) =  split " " => $str},
    }
    
    __END__
    Benchmark: running index, limit, plain, regex, each for at least 100 CPU seconds...
    index: 125 wallclock secs (105.53 usr +  0.00 sys = 105.53 CPU) @ 34487.35/s (n=3639450)
    regex: 121 wallclock secs (105.06 usr +  0.00 sys = 105.06 CPU) @ 43695.61/s (n=4590661)
    limit: 123 wallclock secs (104.03 usr +  0.02 sys = 104.05 CPU) @ 48699.04/s (n=5067135)
    plain: 120 wallclock secs (105.18 usr +  0.02 sys = 105.20 CPU) @ 52044.32/s (n=5475062)
    

    The bottom line is, if you want Perl to do the optimizing, keep your code simple.

    -- Abigail

      You can post code with brackets by enclosing it in a <CODE> </CODE> block. See the Site How To for more information.

      - Matt

        Urg, that's truely twisted. Not only has <CODE> already a meaning in HTML, picking indexing as a short cut for links on a site devoted to programming isn't the most convenient choice.

        -- Abigail

      Thanks, I had mistyped my numbers (I was on a old 15" monitor where the font is set to like 3 pixels high and even 640x480 looked fuzzy, yeck).

      Anyway, I noticed that there was some descripency between the relative speeds of regex versus split (that is, split used properly). And I wanted to see why, so first I added a few more tests:

      Four => sub { $testlarge =~ /([^\s]*)\s+/; my $y = $1; }, Eight => sub { my $y = $testlarge =~ /(\S+)/; }, Six => sub { my ($y) = split(/\s+/, $testlarge,2) }, Seven => sub { my ($y) = split(/\s+/, $testlarge) }
      I noticed that your regex was different, so I wanted to see if was why things were slower (however, I didn't think so, since your's was simplier).

      Running with string equal to "a " x 100 000 , I got these numbers:

      Eight: 11 wallclock secs ( 4.68 usr + 5.46 sys = 10.14 CPU) @ 45 +0.69/s (n=4570) Four: 10 wallclock secs ( 4.79 usr + 5.21 sys = 10.00 CPU) @ 44 +9.60/s (n=4496) Seven: 10 wallclock secs ( 7.28 usr + 2.84 sys = 10.12 CPU) @ 29 +3.97/s (n=2975) Six: 10 wallclock secs ( 7.13 usr + 2.88 sys = 10.01 CPU) @ 29 +3.81/s (n=2941)
      Which were (despite using the same regex) were still 50% faster than split, rather than being 40% slower.
      Next I reduced the size of my string to: "a " x 100 . Here I got these numbers:
      Eight: 12 wallclock secs (10.00 usr + 0.00 sys = 10.00 CPU) @ 13 +0970.90/s (n=1309709) Four: 11 wallclock secs (10.39 usr + 0.00 sys = 10.39 CPU) @ 88 +839.36/s (n=923041) Seven: 10 wallclock secs (10.49 usr + 0.00 sys = 10.49 CPU) @ 12 +2499.33/s (n=1285018) Six: 10 wallclock secs (10.54 usr + 0.00 sys = 10.54 CPU) @ 12 +1918.22/s (n=1285018)
      Now the regex code (yours) leads by less than 10%, and my regex trails by a good 30%. So, I guess the conclusion is that regex preforms better than split on large scalars? I don't feel like mucking in the perl source code right now, so my guess as to why this is, has nothing to do with the way regex's or splits actually process the data, but rather that split is probably receiving a copy of the data, whereas regex is receiving a reference.

      Cheers,
      Gryn

        I don't feel like mucking in the perl source code right now, so my guess as to why this is, has nothing to do with the way regex's or splits actually process the data, but rather that split is probably receiving a copy of the data, whereas regex is receiving a reference.

        No, it has to do with what split and regex produce, not with what they get (behind the scenes, everything is a reference anyway). The regex only has to create a short new string, while the split (even if there's a split into 2 fields) has to create a large new string. And that's taking time.

        So, if you have a gigantic string, and you only want the first, short field, a regex is the way to go. But usually you encounter short strings, and that's when split works better, despite itself using a regex. (But a much simpler regex, and it even might be that the case of " " is optimized itself too).

        -- Abigail