comment on

I think I've found an interesting optimization, but even after some digging here on PerlMonks, the Perl docs, and looking through the Camel I haven't yet found it documented*. Of course, it's also possible that I made a mistake in my benchmark, or that this optimization is common knowledge, in which case I would be happy to be enlightened :-)

<update> * Because it isn't there :-) It appears that the answer is that while split is quite fast, it still splits the string into an array before iterating over that array (that's what the memory consumption seems to show). The filehandle method proposed by Laurent_R seems to be the best way to go about my task instead, assuming what you're splitting on is a fixed string. See the all the replies below for more details. </update>

I had a multiline string and wanted to iterate through the lines, and became curious what the fastest way to do that was. I was pleasantly surprised that in my benchmark, on v5.24.1, for (split /\n/, $string) was fastest, despite my original worry that it might split the string into a long list before iterating over it.

I know that foreach (1..1000000) has been optimized to use an iterator instead of building a huge list since 5.005, and I found some discussion of the optimization of split // in this thread.

I also found several references to @x = split ... being optimized, which I think might be the reason that for (split ...) is so fast. I don't know how long this optimization has been present, but I found, for example, commit e4e95921cd0fd0, which seems to indicate it's been present since Perl 3. If anyone knows more and wants to set the record straight, please do so! :-)

use warnings;
use strict;
use Data::Dump qw/dd pp/;
use Benchmark qw/cmpthese/;

# example output:
# 5.024001
#         Rate regex index split
# regex 9.64/s    --   -8%  -41%
# index 10.5/s    9%    --  -36%
# split 16.4/s   70%   56%    --

my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 50000;
use constant TEST => 0;
my $expect = join "\0", split /\n/, $str;
$expect=~s/o/i/g;
#dd [split /\n/, $str], $expect;

dd $];
cmpthese(-2, {
    split => sub {
        my @lines;
        my @x = split /\n/, $str;
        #@x = map {$_} @x; # significant slowdown
        #for my $line (map {$_} split /\n/, $str) { # still fairly fas
+t
        for my $line (@x) {
            $line=~s/o/i/g;
            push @lines, $line;
        }
        if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines
+ }
    },
    regex => sub {
        my @lines;
        pos($str)=0;
        #while ($str=~/^(.*)$/mgc) { # slower
        while ($str=~/\G(?|(.*?)\n|(.+)\z)/gc) {
            my $line = $1;
            $line=~s/o/i/g;
            push @lines, $line;
        }
        if (TEST) {
            die unless pos($str)==length($str);
            die pp(@lines) unless $expect eq join "\0", @lines;
        }
    },
    index => sub {
        my @lines;
        for ( my ($pos,$nextpos) = (0); $pos<length($str);
            $pos=$nextpos+1 ) {
                $nextpos = do { my $i=index($str,"\n",$pos);
                    $i<0?length($str):$i };
                my $line = substr $str, $pos, $nextpos-$pos;
                $line=~s/o/i/g;
                push @lines, $line
        }
        if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines
+ }
    },
});
[download]

In reply to Is foreach split Optimized? (Update: No.) by haukex

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.