Is foreach split Optimized? (Update: No.)

haukex has asked for the wisdom of the Perl Monks concerning the following question:

I think I've found an interesting optimization, but even after some digging here on PerlMonks, the Perl docs, and looking through the Camel I haven't yet found it documented*. Of course, it's also possible that I made a mistake in my benchmark, or that this optimization is common knowledge, in which case I would be happy to be enlightened :-)

<update> * Because it isn't there :-) It appears that the answer is that while split is quite fast, it still splits the string into an array before iterating over that array (that's what the memory consumption seems to show). The filehandle method proposed by Laurent_R seems to be the best way to go about my task instead, assuming what you're splitting on is a fixed string. See the all the replies below for more details. </update>

I had a multiline string and wanted to iterate through the lines, and became curious what the fastest way to do that was. I was pleasantly surprised that in my benchmark, on v5.24.1, for (split /\n/, $string) was fastest, despite my original worry that it might split the string into a long list before iterating over it.

I know that foreach (1..1000000) has been optimized to use an iterator instead of building a huge list since 5.005, and I found some discussion of the optimization of split // in this thread.

I also found several references to @x = split ... being optimized, which I think might be the reason that for (split ...) is so fast. I don't know how long this optimization has been present, but I found, for example, commit e4e95921cd0fd0, which seems to indicate it's been present since Perl 3. If anyone knows more and wants to set the record straight, please do so! :-)

use warnings;
use strict;
use Data::Dump qw/dd pp/;
use Benchmark qw/cmpthese/;

# example output:
# 5.024001
#         Rate regex index split
# regex 9.64/s    --   -8%  -41%
# index 10.5/s    9%    --  -36%
# split 16.4/s   70%   56%    --

my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 50000;
use constant TEST => 0;
my $expect = join "\0", split /\n/, $str;
$expect=~s/o/i/g;
#dd [split /\n/, $str], $expect;

dd $];
cmpthese(-2, {
    split => sub {
        my @lines;
        my @x = split /\n/, $str;
        #@x = map {$_} @x; # significant slowdown
        #for my $line (map {$_} split /\n/, $str) { # still fairly fas
+t
        for my $line (@x) {
            $line=~s/o/i/g;
            push @lines, $line;
        }
        if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines
+ }
    },
    regex => sub {
        my @lines;
        pos($str)=0;
        #while ($str=~/^(.*)$/mgc) { # slower
        while ($str=~/\G(?|(.*?)\n|(.+)\z)/gc) {
            my $line = $1;
            $line=~s/o/i/g;
            push @lines, $line;
        }
        if (TEST) {
            die unless pos($str)==length($str);
            die pp(@lines) unless $expect eq join "\0", @lines;
        }
    },
    index => sub {
        my @lines;
        for ( my ($pos,$nextpos) = (0); $pos<length($str);
            $pos=$nextpos+1 ) {
                $nextpos = do { my $i=index($str,"\n",$pos);
                    $i<0?length($str):$i };
                my $line = substr $str, $pos, $nextpos-$pos;
                $line=~s/o/i/g;
                push @lines, $line
        }
        if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines
+ }
    },
});
[download]

Comment on Is foreach split Optimized? (Update: No.) Select or Download Code

Replies are listed 'Best First'.
Re: Is foreach split Optimized? by Laurent_R (Canon) on Jul 09, 2017 at 11:56 UTC
Hi haukex, You might want to also try using a file handle opened on a reference to your string: `filehandle => sub { my @lines; open my $str_fh, "<", \$str or die "cannot open fh $!"; while (<$str_fh>) { chomp; s/o/i/g; push @lines, $_; } },` [download] This appears to be faster than your other subs, including the one with `split` (note that I had to change the count value for the result to be meaningful): `$ perl bench_split.pl s/iter regex index split filehandle regex 3.44 -- -9% -36% -48% index 3.13 10% -- -29% -43% split 2.21 56% 42% -- -19% filehandle 1.78 93% 76% 24% --` [download]	[reply] [d/l] [select]
Re^2: Is foreach split Optimized? by haukex (Archbishop) on Jul 09, 2017 at 12:18 UTC
I didn't think of that one, thanks! What version of Perl are you using? Unfortunately on my machine with v5.24.1, the filehandle method is still slightly slower than split: `5.024001 Rate regex index filehandle split regex 9.65/s -- -7% -34% -39% index 10.4/s 8% -- -29% -34% filehandle 14.6/s 51% 40% -- -7% split 15.7/s 63% 51% 8% --` [download] I haven't yet gotten around to experimenting with other versions of Perl.	[reply] [d/l]
Re^3: Is foreach split Optimized? by Laurent_R (Canon) on Jul 09, 2017 at 14:17 UTC
This was with version 5.14, using Cygwin. I will try with a more recent version under Windows. Update This very strange: running the exact same program on the same computer but under a pure Windows setting and `perl 5, version 24, subversion 1 (v5.24.1)` and from a bash console, I obtain awfully bad results for the file handle solution: `s/iter filehandle regex index split filehandle 100 -- -97% -97% -98% regex 3.50 2758% -- -14% -42% index 3.02 3217% 16% -- -33% split 2.02 4868% 74% 50% --` [download] I can only suspect that there might be something wrong with the management of Windows end-of-line pairs of characters. Still under Windows, same program, from a dos windows, with version v5.16.3: `s/iter regex index split filehandle regex 3.69 -- -2% -29% -39% index 3.63 2% -- -27% -38% split 2.63 40% 38% -- -15% filehandle 2.25 64% 61% 17% --` [download] I also tried (again on Cygwin with Perl version v5.14.4) this additional sub: `filehandle2 => sub { open my $str_fh, "<", \$str or die "cannot open fh $!"; my @lines = <$str_fh>; for (@lines) { chomp; s/o/i/g; ; } },` [download] but the result is not so good as the first filehandle solution (but still better than the others): `$ perl bench_split.pl s/iter regex index split filehandle2 fi +lehandle regex 3.44 -- -10% -35% -45% + -48% index 3.11 11% -- -28% -39% + -42% split 2.24 53% 39% -- -16% + -20% filehandle2 1.89 82% 65% 19% -- + -5% filehandle 1.80 92% 73% 25% 5% + --` [download] Finally, I also tried this solution with a `map`: `filehandle3 => sub { open my $str_fh, "<", \$str or die "cannot open fh $!"; my @lines = map { chomp; s/o/i/g } <$str_fh>; },` [download] and thought it might be faster, but this turns out to be slower than all the other solutions: `$ perl bench_split.pl s/iter filehandle3 regex index split fi +lehandle filehandle3 3.64 -- -7% -16% -39% + -51% regex 3.38 8% -- -9% -34% + -47% index 3.07 18% 10% -- -28% + -42% split 2.22 64% 52% 38% -- + -20% filehandle 1.77 105% 90% 73% 25% + --` [download]	[reply] [d/l] [select]
Re^3: Is foreach split Optimized? by stevieb (Canon) on Jul 09, 2017 at 15:34 UTC
Here's some output on many Perls :) perlbrew exec bench_script.pl perl-5.10.1 ========== Rate regex index split filehandle regex 3.95/s -- -7% -40% -53% index 4.27/s 8% -- -35% -50% split 6.54/s 66% 53% -- -23% filehandle 8.46/s 114% 98% 29% -- perl-5.12.5 ========== Rate index regex split filehandle index 3.96/s -- -10% -39% -49% regex 4.41/s 11% -- -32% -43% split 6.50/s 64% 47% -- -16% filehandle 7.69/s 94% 74% 18% -- perl-5.14.4 ========== Rate index regex split filehandle index 3.67/s -- -14% -47% -49% regex 4.25/s 16% -- -38% -41% split 6.86/s 87% 62% -- -5% filehandle 7.25/s 97% 71% 6% -- perl-5.16.3 ========== Rate index regex filehandle split index 3.23/s -- -19% -45% -50% regex 3.98/s 23% -- -32% -38% filehandle 5.83/s 81% 46% -- -9% split 6.44/s 100% 62% 10% -- perl-5.18.4 ========== Rate index regex split filehandle index 3.62/s -- -1% -45% -46% regex 3.65/s 1% -- -44% -46% split 6.57/s 82% 80% -- -3% filehandle 6.76/s 87% 85% 3% -- perl-5.20.3 ========== Rate index regex split filehandle index 3.50/s -- -10% -44% -45% regex 3.90/s 11% -- -38% -38% split 6.25/s 79% 60% -- -1% filehandle 6.31/s 80% 62% 1% -- perl-5.22.3 ========== Rate index regex split filehandle index 3.62/s -- -0% -38% -48% regex 3.64/s 0% -- -38% -47% split 5.88/s 63% 62% -- -15% filehandle 6.90/s 91% 90% 17% -- perl-5.24.1 ========== Rate regex index filehandle split regex 2.97/s -- -21% -42% -59% index 3.77/s 27% -- -26% -47% filehandle 5.08/s 71% 35% -- -29% split 7.18/s 142% 90% 41% -- perl-5.26.0 ========== Rate index regex split filehandle index 3.00/s -- -25% -49% -53% regex 3.98/s 33% -- -33% -37% split 5.91/s 97% 49% -- -6% filehandle 6.31/s 111% 59% 7% -- [download]	[reply] [d/l]
Re^4: Is foreach split Optimized? by haukex (Archbishop) on Jul 11, 2017 at 10:45 UTC
Re: Is foreach split Optimized? by ikegami (Patriarch) on Jul 09, 2017 at 17:44 UTC
I also found several references to @x = split ... being optimized That's just a normal call to `split`. It's actually the following that's optimized: `($x, $y, $z) = split ...;` [download] It only finds and returns the first three items instead of all of them. Obviously, that optimization can't be used for `@a = split ...;` [download] This post lists the `for` optimizations.	[reply] [d/l] [select]
Re^2: Is foreach split Optimized? by haukex (Archbishop) on Jul 11, 2017 at 10:37 UTC
Thank you for the link! Taken together with the AM's post here, it does seem that I was just surprised by the speed of split, causing me to initially read too much into the commit messages regarding `@x = split`, like 692044df and 5012eebe.	[reply] [d/l]
Re: Is foreach split Optimized? by Marshall (Canon) on Jul 09, 2017 at 11:32 UTC
I am curious about your benchmark program. With `my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 500000;` I get: `5.020002 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) s/iter index regex split index 26.5 -- -61% -66% regex 10.3 158% -- -12% split 9.05 193% 13% --` [download] which basically means that the results don't matter. Code as with `my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 50000;` Your benchmark default. `5.020002 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate index regex split index 1.16/s -- -21% -34% regex 1.47/s 27% -- -16% split 1.74/s 51% 19% --` [download] I am not sure what this means if anything.	[reply] [d/l] [select]
Re^2: Is foreach split Optimized? by haukex (Archbishop) on Jul 09, 2017 at 11:44 UTC
(warning: too few iterations for a reliable count) You can get rid of those warnings by increasing the runtime by saying e.g. `cmpthese(-5, ...`. I did fiddle with it quite a bit, adjusting the string length, runtime, etc. and the results were always consistent that split was fastest by a significant margin.	[reply] [d/l]
Re^3: Is foreach split Optimized? by Marshall (Canon) on Jul 09, 2017 at 11:57 UTC
Interesting: `5.020002 Rate index regex split split2 index 1.17/s -- -23% -35% -43% regex 1.51/s 29% -- -16% -26% split 1.79/s 53% 18% -- -12% split2 2.03/s 74% 35% 14% --` [download] split2 uses tr/// instead of regex s///... Read more... (3 kB)	[reply] [d/l] [select]
Re^4: Is foreach split Optimized? by haukex (Archbishop) on Jul 09, 2017 at 12:27 UTC
Re^5: Is foreach split Optimized? by Marshall (Canon) on Jul 09, 2017 at 13:02 UTC
Re: Is foreach split Optimized? by Anonymous Monk on Jul 09, 2017 at 20:05 UTC
The question seems to be, "does perl split the entire string before beginning the for loop?" We can answer that by looking at how much memory it is using. `use strict; use warnings; my $n = 10_000_000; system "ps -orss $$"; my $x = 'foo' x $n; system "ps -orss $$"; for my $y (split /f/, $x) { system "ps -orss $$"; last; }` [download] The answer seems to be, "yes." `RSS 3712 RSS 33028 RSS 674900` [download]	[reply] [d/l] [select]
Re^2: Is foreach split Optimized? by haukex (Archbishop) on Jul 11, 2017 at 10:28 UTC
Thank you, this is probably the best way of looking at it! I took your idea and expanded on it, below. It seems that the answer to my original question is No, there isn't any magic going on with `for (split ...)`, it's just that while it may be a memory hog, split is still pretty fast - that's what threw me off initially. #!/usr/bin/env perl use warnings; use 5.014; my $cnt = 5000000; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4340 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23936 @y = ("abc") x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 455604 _END_ say "---"; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4544 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23972 @y = split /\n/, $x; say `ps -orss $$`=~s/\s+/ /gr; # RSS 423544 _END_ say "---"; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4540 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23964 for (split /\n/, $x) { say `ps -orss $$`=~s/\s+/ /gr; # RSS 457512 last } _END_ say "---"; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4336 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23932 open my $fh, '<', \$x or die $!; while (<$fh>) { say `ps -orss $$`=~s/\s+/ /gr; # RSS 24000 last } _END_ [download]	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.