Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I was just thinking... say you have:
$x = "a b c d e f g"; ($y) = (split(/\s+/,$x))[0];
Does perl optimize the split and only grab the first field (the 'a') and not continue breaking up the line after that? (I know this may not be the best place to ask this question, but I couldn't find a better one.)

Replies are listed 'Best First'.
RE: is split optimized?
by Russ (Deacon) on Jul 14, 2000 at 06:09 UTC
    Is split optimized? Yes. Most of the Perl internals are very efficient.

    Does split only find the first chunk when you only ask for the first chunk? No. Since you are calling split in a list context, it generates a list of all elements and assigns that to your list.

    You were very careful to call split in list context, BTW. In your example, $y does not have to be in parentheses, because your right-side construct puts split in list context. Both <nobr>($y) = split()</nobr> and <nobr>$y = (split())[0]</nobr> call split in list context. You don't need both, but it certainly works as you have it.

    BTW, Seekers of Perl Wisdom is the right place to ask this kind of question.

    Russ
    Brainbench 'Most Valuable Professional' for Perl

      If you try to call split in a scalar context you will recieve: "Use of implicit split to @_ is deprecated". The code will run, and will be slightly faster:
      Five: 13 wallclock secs (13.39 usr + 0.00 sys = 13.39 CPU) @ +5.15/s (n=69) One: 13 wallclock secs (12.92 usr + 0.01 sys = 12.93 CPU) @ 4 +.87/s (n=63)
      (relevant code:)
      One => sub { my ($y) = (split(/\s+/,$testlarge))[0]; }, Five => sub { my $y = split(/\s+/,$testlarge) }
      While a tad faster, using something besides split (such as a regex) show that split doesn't optimize away the other entries. See my other post on this thread for more benchmark results.

      Ciao,
      Gryn

Re: is split optimized?
by btrott (Parson) on Jul 14, 2000 at 06:25 UTC
    It should if you give the third argument to split:
    my $x = "a b c d e f g"; my $y = (split /\s+/, $x, 2)[0];
    In this case split should only split your string once, and after it's seen the first \s+ it should stop. Now, I can't say that it will actually *do that*, but that's what it *seems* should happen.

    From the split docs:

    split /PATTERN/, EXPR, LIMIT ... If LIMIT is specified and positive, splits into no more than that many fields (though it may split into fewer).
      It does, here are some numbers:
      Six: 10 wallclock secs ( 4.82 usr + 5.18 sys = 10.00 CPU) @ 43 +7.30/s (n=4373) One: 13 wallclock secs (12.92 usr + 0.01 sys = 12.93 CPU) @ 4 +.87/s (n=63) Five: 13 wallclock secs (13.39 usr + 0.00 sys = 13.39 CPU) @ 5 +.15/s (n=69) Four: 10 wallclock secs ( 4.74 usr + 5.56 sys = 10.30 CPU) @ 44 +3.69/s (n=4570)
      The code for this is:
      #!/usr/bin/perl -w use strict; use Benchmark; my $testlarge = "a " x 100000; my $testsmall = "a b c d e f"; timethese(-10,{ One => sub { my ($y) = (split(/\s+/,$testlarge))[0]; }, Two => sub { my ($y) = (split(/\s+/,$testsmall))[0]; }, Three => sub { $testsmall =~ /([^\s]*)\s+/; my $y = $1; }, Four => sub { $testlarge =~ /([^\s]*)\s+/; my $y = $1; }, Five => sub { my $y = split(/\s+/,$testlarge) }, Six => sub { my ($y) = split(/\s+/,$testlarge,1) } });
      Note that using an equivalent regex is slightly faster, but the split done properly (using the third arguement) preforms at the "correct" level.

      Thanks for submitting the correct answer :) hehe.

      Chow,
      Gryn

        Thanks for submitting the correct answer :) hehe.

        But you used it in an incorrect way. If the third argument is 1, it's effectively a noop. The third argument does not mean to discard everything after the first field.

            my ($y) = split " ", "a b", 1;
            print $y;
        
        will print a b, and not a.

        If you want to use only the first field, and use a third argument, just use:

            my ($y) = split " ", $string, 2;
        
        That's right. No indexing required. But even the limit isn't required. Just the simple:
            my ($y) = split " ", $string;
        
        will do. And because it is so simple, Perl can optimize that. Here's a benchmark program (there are brackets where indexing is used - for some reason, perlmonks strip them), and the results:
        #!/opt/perl/bin/perl -w
        
        use strict;
        use Benchmark;
        
        my $str = "a " x 6;
        
        timethese -100 => {           
            index   =>  sub {my ($y) = (split " " => $str) [0]},
            regex   =>  sub {my ($y) = $str =~ /(\S+)/},
            limit   =>  sub {my ($y) = (split " " => $str, 2) [0]},
            plain   =>  sub {my ($y) =  split " " => $str},
        }
        
        __END__
        Benchmark: running index, limit, plain, regex, each for at least 100 CPU seconds...
        index: 125 wallclock secs (105.53 usr +  0.00 sys = 105.53 CPU) @ 34487.35/s (n=3639450)
        regex: 121 wallclock secs (105.06 usr +  0.00 sys = 105.06 CPU) @ 43695.61/s (n=4590661)
        limit: 123 wallclock secs (104.03 usr +  0.02 sys = 104.05 CPU) @ 48699.04/s (n=5067135)
        plain: 120 wallclock secs (105.18 usr +  0.02 sys = 105.20 CPU) @ 52044.32/s (n=5475062)
        

        The bottom line is, if you want Perl to do the optimizing, keep your code simple.

        -- Abigail

      well, according to the benchmark:
      #!/usr/bin/perl -w use strict; use Benchmark; my $x = "a b c d e f g"; sub list_context { my $y = (split(/\s+/, $x))[0]; } sub extra_argument { my $y = (split(/\s+/, $x, 2))[0]; } timethese(-3, { "LIST CONTEXT" => \&list_context, "EXTRA ARGUMENT" => \&extra_argument, }); [ed@darkness ed]$ perl ./splittest.pl Benchmark: running EXTRA ARGUMENT, LIST CONTEXT, each for at least 3 C +PU seconds... EXTRA ARGUMENT: 2 wallclock secs ( 3.14 usr + 0.00 sys = 3.14 CPU) +@ 115563.69/s (n=362870) LIST CONTEXT: 3 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ +57053.46/s (n=181430)
      the extra argument version of split kicks the living crap out of not using it... so i guess, yes ;) it does make a difference!
Re: is split optimized?
by gryng (Hermit) on Jul 14, 2000 at 06:19 UTC
    The short answer is, no:
    #!/usr/bin/perl -w use strict; use Benchmark; my $testlarge = "a " x 100000; my $testsmall = "a b c d e f"; timethese(-10,{ One => sub { my ($y) = (split(/\s+/,$testlarge))[0]; }, Two => sub { my ($y) = (split(/\s+/,$testsmall))[0]; }, Three => sub { $testsmall =~ /([^\s]*)\s+/; my $y = $1; }, Four => sub { $testlarge =~ /([^\s]*)\s+/; my $y = $1; }, });
    And the results:
    Benchmark: running Four, One, Three, Two, each for at least 10 CPU sec +onds... Four: 10 wallclock secs ( 4.55 usr + 5.48 sys = 10.03 CPU) @ 43 +9.88/s (n=4412) One: 13 wallclock secs (12.83 usr + 0.07 sys = 12.90 CPU) @ 4 +.88/s (n=63) Three: 11 wallclock secs (10.62 usr + 0.00 sys = 10.62 CPU) @ 94 +938.32/s (n=1008245) Two: 10 wallclock secs (10.01 usr + 0.00 sys = 10.01 CPU) @ 67 +121.98/s (n=671891)

    Enjoy!
    Gryn

      mea culpa, fellow monk, but I must disagree with your answer. Yes, the benchmarks say something but it is what they say I feel needs deeper interpretation.

      As Russ said, split is incredibly well optimized. Most of the perl internals are. There have been many C coders of wonderous talent pouring over the code to make it so. You code demonstrates that the AM was not using the correct tool, which is an answer to an unasked question.

      Your four pieces of code are doing radically different things. The regex is stopping after the first match, while the split must work the entire string. Until you compare apples to apples, no conclusion can be drawn. Let us run this test and do it correctly. Note the slight changes I made to the regex code. That should result in a better comparison.

      #!/usr/local/bin/perl -w use strict; use Benchmark; my $testlarge = "a " x 100000; my $testsmall = "a b c d e f"; timethese(-10,{ One => sub { my ($y) = (split(/\s+/,$testlarge))[0]; }, Two => sub { my ($y) = (split(/\s+/,$testsmall))[0]; }, Three => sub { my $y = ( $testsmall =~ (/([^\s]*)\s+/g))[0]; }, Four => sub { my $y = ( $testlarge =~ (/([^\s]*)\s+/g))[0]; }, }); mik@mach5:/home/mik/monk)./benchthis.pl Benchmark: running Four, One, Three, Two, each for at least 10 CPU sec +onds... Four: 19 wallclock secs (18.38 usr + 0.02 sys = 18.40 CPU) @ 1 +.14/s (n=21) One: 13 wallclock secs (12.72 usr + 0.00 sys = 12.72 CPU) @ 2 +.12/s (n=27) Three: 12 wallclock secs (10.28 usr + 0.00 sys = 10.28 CPU) @ 12 +748.55/s (n=131071) Two: 11 wallclock secs (10.00 usr + 0.00 sys = 10.00 CPU) @ 18 +600.60/s (n=186006)
      When comparing apples to apples, it seems split is highly optimized. This more an issue of choosing the right tool for the job at hand.

      This rant brought to you by
      mikfire

        I agree that I was not comparing the equivalent (in work requested to be done) code. However my point was to submit code that (plainly) only got the first arguement, in order to show how much more work split was doing. As pointed out later, by btrott and nardo, adding split's third arguement brings it back up to regex's speed, but that is because they are now doing an equivalent amount of work.

        I answered Anonymous Monks's question of wether "perl optimize(s) the split and only grab the first field" with the line: ($y) = (split(/\s+/,$x))[0]; To which the answer is no. But as I conceeded to btrott, he (and nardo) had the "correct" answer, of saying that you need to add a third arguement to get split to only do the first match.

        Appreciately, you probably responded because I used regex's in my example and you did not want Anonymous Monk to mistakenly think that regex's were faster than split. However I do not think he took it that way, rather his post seemed to convey an understanding of all of what I just mentioned above (minus the fact that we did not know/remember about the third arguement in split).

        Anyway, this is mainly here to clear up the comment in the chatterbox about benchmark being a hammer. I was only using it to show, fairly concretely, that split was not stopping after the first match. I didn't mean to imply that it couldn't.

        Cheers,
        Gryn

      Excellent -- if I read your output correctly, regex is noticably faster than split if all I want is the first field. Thanks for your helpful response. You anticipated my followup question, and you did so with code, which will enable me to test my hypothesis myself next time. "Teach a man to fish..." Thanks!
        At the risk of appearing to be all over this thread, I wanted to say "You're welcome". (I would have just /msg'd but /msg'ing to Anonymous monks doesn't work too well).
Re: is split optimized?
by nardo (Friar) on Jul 14, 2000 at 06:39 UTC
    According to perldoc -f split

    When assigning to a list, if LIMIT is omitted, Perl supplies a LIMIT one larger than the number of variables in the list

    So,
    ($y) = split(/\s+/, $x)
    is equivalent to
    ($y) = split(/\s+/, $x, 2)
    which, as others have pointed out, will only split it once.
Re: is split optimized?
by bs (Novice) on Jul 14, 2000 at 08:02 UTC

    I don't know if split is or isn't optimized; I just have a story related to split to share.

    In our code library, we have a function named rand_split, which looks like this:

    	sub rand_split {
    	    my ($sep, $string) = @_;
    	    my ($element, $char, $pos, $end, @array);
    
    	    $end = length($string);
    	    $element = "";
                for ($pos = 0; $pos < $end; $pos++) {
    	        $char = substr($string, $pos, 1);
    	        if ($char eq $sep) {
    	            push (@array, $element);
    	            $element = "";
    	        } else {
    	            $element = $element . $char;
    	        }
    	    }
    	    push (@array, $element);
    	    return (@array);
    	}
    

    rand_split was written by a guy named Rand a couple of programmer generations ago. We don't know why he reimplimented it; We don't know a lot of things about it. However, another programmer from around that generation claimed that "If he did it, there must have been a reason." It's not like he didn't know split didn't exist, meaning he purposefully reinvented a wheel. As nearly as we can tell, it walks like split and talks like split; Therefore, it is split. It's used in one script these days, and probably zero in the very near future, but I think we've kept it around mainly for gag value, giving every new generation of programmer something to wonder about.

      Well, rand_split doesn't walk and talk like split. Split takes a /PATTERN/ as the first argument, not a single character (and it can take a third argument, etc, etc). I can see a possible motivation for the redundant implementation, then: "Rand" might have been thinking that his version could be faster than split since he doesn't have to worry about regex matches. He probably should have checked this assumption out, though, since it's blatantly wrong. I threw together this script (without warnings or strict! horror!) to check the effiency of rand_split:
      use Benchmark; @chars = ('x', ' '); $string = ""; $string .= $chars[rand 2] for (1..1000); timethese( 5000, { 'split' => 'split / /, $string', 'rand_split' => 'rand_split(" ", $string)' }); sub rand_split { my ($sep, $string) = @_; my ($element, $char, $pos, $end, @array); $end = length($string); $element = ""; for ($pos = 0; $pos < $end; $pos++) { $char = substr($string, $pos, 1); if ($char eq $sep) { push (@array, $element); $element = ""; } else { $element = $element . $char; } } push (@array, $element); return (@array); }

      These were the disheartening results:

      rand_split: 43 wallclock secs (40.37 usr + 0.01 sys = 40.38 CPU) @ 12 +3.83/s (n=5000) split: 4 wallclock secs ( 3.73 usr + 0.00 sys = 3.73 CPU) @ 13 +42.28/s (n=5000)
      Clearly, split is quite capable of optimizing static patterns and doing it much faster than Perl code (since split is, of course, implemented in C). Gag value is about all you'll get out of this routine ;-).

        Hmm... As you've clearly demonstrated the code runs like a three legged greyhound. I would be interested to know what version of Perl Mr Rand was using when he wrote the function and do the benchmark with it. You may find the answer is the same. Then again you may find that the split function has been optimised since rand_split was written.

        Nuance

RE: is split optimized?
by Anonymous Monk on Jul 14, 2000 at 15:37 UTC
    It will optimize it if you use this syntax instead: ($y) = split(/\s+/,$x); As it says in the manual (perldoc -f split): ($login, $passwd, $remainder) = split(/:/, $_, 3); When assigning to a list, if LIMIT is omitted, Perl supplies a LIMIT one larger than the number of variables in the list, to avoid unnecessary work. For the list above LIMIT would have been 4 by default. In time critical applications it behooves you not to split into more fields than you really need. -AM