Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Is foreach split Optimized? (Update: No.)

by haukex (Archbishop)
on Jul 09, 2017 at 10:36 UTC ( [id://1194593]=perlquestion: print w/replies, xml ) Need Help??

haukex has asked for the wisdom of the Perl Monks concerning the following question:

I think I've found an interesting optimization, but even after some digging here on PerlMonks, the Perl docs, and looking through the Camel I haven't yet found it documented*. Of course, it's also possible that I made a mistake in my benchmark, or that this optimization is common knowledge, in which case I would be happy to be enlightened :-)

<update> * Because it isn't there :-) It appears that the answer is that while split is quite fast, it still splits the string into an array before iterating over that array (that's what the memory consumption seems to show). The filehandle method proposed by Laurent_R seems to be the best way to go about my task instead, assuming what you're splitting on is a fixed string. See the all the replies below for more details. </update>

I had a multiline string and wanted to iterate through the lines, and became curious what the fastest way to do that was. I was pleasantly surprised that in my benchmark, on v5.24.1, for (split /\n/, $string) was fastest, despite my original worry that it might split the string into a long list before iterating over it.

I know that foreach (1..1000000) has been optimized to use an iterator instead of building a huge list since 5.005, and I found some discussion of the optimization of split // in this thread.

I also found several references to @x = split ... being optimized, which I think might be the reason that for (split ...) is so fast. I don't know how long this optimization has been present, but I found, for example, commit e4e95921cd0fd0, which seems to indicate it's been present since Perl 3. If anyone knows more and wants to set the record straight, please do so! :-)

use warnings; use strict; use Data::Dump qw/dd pp/; use Benchmark qw/cmpthese/; # example output: # 5.024001 # Rate regex index split # regex 9.64/s -- -8% -41% # index 10.5/s 9% -- -36% # split 16.4/s 70% 56% -- my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 50000; use constant TEST => 0; my $expect = join "\0", split /\n/, $str; $expect=~s/o/i/g; #dd [split /\n/, $str], $expect; dd $]; cmpthese(-2, { split => sub { my @lines; my @x = split /\n/, $str; #@x = map {$_} @x; # significant slowdown #for my $line (map {$_} split /\n/, $str) { # still fairly fas +t for my $line (@x) { $line=~s/o/i/g; push @lines, $line; } if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines + } }, regex => sub { my @lines; pos($str)=0; #while ($str=~/^(.*)$/mgc) { # slower while ($str=~/\G(?|(.*?)\n|(.+)\z)/gc) { my $line = $1; $line=~s/o/i/g; push @lines, $line; } if (TEST) { die unless pos($str)==length($str); die pp(@lines) unless $expect eq join "\0", @lines; } }, index => sub { my @lines; for ( my ($pos,$nextpos) = (0); $pos<length($str); $pos=$nextpos+1 ) { $nextpos = do { my $i=index($str,"\n",$pos); $i<0?length($str):$i }; my $line = substr $str, $pos, $nextpos-$pos; $line=~s/o/i/g; push @lines, $line } if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines + } }, });

Replies are listed 'Best First'.
Re: Is foreach split Optimized?
by Laurent_R (Canon) on Jul 09, 2017 at 11:56 UTC
    Hi haukex,

    You might want to also try using a file handle opened on a reference to your string:

    filehandle => sub { my @lines; open my $str_fh, "<", \$str or die "cannot open fh $!"; while (<$str_fh>) { chomp; s/o/i/g; push @lines, $_; } },
    This appears to be faster than your other subs, including the one with split (note that I had to change the count value for the result to be meaningful):
    $ perl bench_split.pl s/iter regex index split filehandle regex 3.44 -- -9% -36% -48% index 3.13 10% -- -29% -43% split 2.21 56% 42% -- -19% filehandle 1.78 93% 76% 24% --

      I didn't think of that one, thanks! What version of Perl are you using? Unfortunately on my machine with v5.24.1, the filehandle method is still slightly slower than split:

      5.024001 Rate regex index filehandle split regex 9.65/s -- -7% -34% -39% index 10.4/s 8% -- -29% -34% filehandle 14.6/s 51% 40% -- -7% split 15.7/s 63% 51% 8% --

      I haven't yet gotten around to experimenting with other versions of Perl.

        This was with version 5.14, using Cygwin. I will try with a more recent version under Windows.

        Update This very strange: running the exact same program on the same computer but under a pure Windows setting and perl 5, version 24, subversion 1 (v5.24.1) and from a bash console, I obtain awfully bad results for the file handle solution:

        s/iter filehandle regex index split filehandle 100 -- -97% -97% -98% regex 3.50 2758% -- -14% -42% index 3.02 3217% 16% -- -33% split 2.02 4868% 74% 50% --
        I can only suspect that there might be something wrong with the management of Windows end-of-line pairs of characters.

        Still under Windows, same program, from a dos windows, with version v5.16.3:

        s/iter regex index split filehandle regex 3.69 -- -2% -29% -39% index 3.63 2% -- -27% -38% split 2.63 40% 38% -- -15% filehandle 2.25 64% 61% 17% --
        I also tried (again on Cygwin with Perl version v5.14.4) this additional sub:
        filehandle2 => sub { open my $str_fh, "<", \$str or die "cannot open fh $!"; my @lines = <$str_fh>; for (@lines) { chomp; s/o/i/g; ; } },
        but the result is not so good as the first filehandle solution (but still better than the others):
        $ perl bench_split.pl s/iter regex index split filehandle2 fi +lehandle regex 3.44 -- -10% -35% -45% + -48% index 3.11 11% -- -28% -39% + -42% split 2.24 53% 39% -- -16% + -20% filehandle2 1.89 82% 65% 19% -- + -5% filehandle 1.80 92% 73% 25% 5% + --
        Finally, I also tried this solution with a map:
        filehandle3 => sub { open my $str_fh, "<", \$str or die "cannot open fh $!"; my @lines = map { chomp; s/o/i/g } <$str_fh>; },
        and thought it might be faster, but this turns out to be slower than all the other solutions:
        $ perl bench_split.pl s/iter filehandle3 regex index split fi +lehandle filehandle3 3.64 -- -7% -16% -39% + -51% regex 3.38 8% -- -9% -34% + -47% index 3.07 18% 10% -- -28% + -42% split 2.22 64% 52% 38% -- + -20% filehandle 1.77 105% 90% 73% 25% + --

        Here's some output on many Perls :)

        perlbrew exec bench_script.pl perl-5.10.1 ========== Rate regex index split filehandle regex 3.95/s -- -7% -40% -53% index 4.27/s 8% -- -35% -50% split 6.54/s 66% 53% -- -23% filehandle 8.46/s 114% 98% 29% -- perl-5.12.5 ========== Rate index regex split filehandle index 3.96/s -- -10% -39% -49% regex 4.41/s 11% -- -32% -43% split 6.50/s 64% 47% -- -16% filehandle 7.69/s 94% 74% 18% -- perl-5.14.4 ========== Rate index regex split filehandle index 3.67/s -- -14% -47% -49% regex 4.25/s 16% -- -38% -41% split 6.86/s 87% 62% -- -5% filehandle 7.25/s 97% 71% 6% -- perl-5.16.3 ========== Rate index regex filehandle split index 3.23/s -- -19% -45% -50% regex 3.98/s 23% -- -32% -38% filehandle 5.83/s 81% 46% -- -9% split 6.44/s 100% 62% 10% -- perl-5.18.4 ========== Rate index regex split filehandle index 3.62/s -- -1% -45% -46% regex 3.65/s 1% -- -44% -46% split 6.57/s 82% 80% -- -3% filehandle 6.76/s 87% 85% 3% -- perl-5.20.3 ========== Rate index regex split filehandle index 3.50/s -- -10% -44% -45% regex 3.90/s 11% -- -38% -38% split 6.25/s 79% 60% -- -1% filehandle 6.31/s 80% 62% 1% -- perl-5.22.3 ========== Rate index regex split filehandle index 3.62/s -- -0% -38% -48% regex 3.64/s 0% -- -38% -47% split 5.88/s 63% 62% -- -15% filehandle 6.90/s 91% 90% 17% -- perl-5.24.1 ========== Rate regex index filehandle split regex 2.97/s -- -21% -42% -59% index 3.77/s 27% -- -26% -47% filehandle 5.08/s 71% 35% -- -29% split 7.18/s 142% 90% 41% -- perl-5.26.0 ========== Rate index regex split filehandle index 3.00/s -- -25% -49% -53% regex 3.98/s 33% -- -33% -37% split 5.91/s 97% 49% -- -6% filehandle 6.31/s 111% 59% 7% --
Re: Is foreach split Optimized?
by ikegami (Patriarch) on Jul 09, 2017 at 17:44 UTC

    I also found several references to @x = split ... being optimized

    That's just a normal call to `split`. It's actually the following that's optimized:

    ($x, $y, $z) = split ...;

    It only finds and returns the first three items instead of all of them. Obviously, that optimization can't be used for

    @a = split ...;

    This post lists the for optimizations.

      Thank you for the link! Taken together with the AM's post here, it does seem that I was just surprised by the speed of split, causing me to initially read too much into the commit messages regarding @x = split, like 692044df and 5012eebe.

Re: Is foreach split Optimized?
by Marshall (Canon) on Jul 09, 2017 at 11:32 UTC
    I am curious about your benchmark program.
    With my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 500000;
    I get:
    5.020002 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) s/iter index regex split index 26.5 -- -61% -66% regex 10.3 158% -- -12% split 9.05 193% 13% --
    which basically means that the results don't matter.

    Code as with my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 50000;
    Your benchmark default.

    5.020002 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate index regex split index 1.16/s -- -21% -34% regex 1.47/s 27% -- -16% split 1.74/s 51% 19% --
    I am not sure what this means if anything.
      (warning: too few iterations for a reliable count)

      You can get rid of those warnings by increasing the runtime by saying e.g. cmpthese(-5, .... I did fiddle with it quite a bit, adjusting the string length, runtime, etc. and the results were always consistent that split was fastest by a significant margin.

        Interesting:
        5.020002 Rate index regex split split2 index 1.17/s -- -23% -35% -43% regex 1.51/s 29% -- -16% -26% split 1.79/s 53% 18% -- -12% split2 2.03/s 74% 35% 14% --
        split2 uses tr/// instead of regex s///...
Re: Is foreach split Optimized?
by Anonymous Monk on Jul 09, 2017 at 20:05 UTC
    The question seems to be, "does perl split the entire string before beginning the for loop?" We can answer that by looking at how much memory it is using.
    use strict; use warnings; my $n = 10_000_000; system "ps -orss $$"; my $x = 'foo' x $n; system "ps -orss $$"; for my $y (split /f/, $x) { system "ps -orss $$"; last; }
    The answer seems to be, "yes."
    RSS 3712 RSS 33028 RSS 674900

      Thank you, this is probably the best way of looking at it! I took your idea and expanded on it, below. It seems that the answer to my original question is No, there isn't any magic going on with for (split ...), it's just that while it may be a memory hog, split is still pretty fast - that's what threw me off initially.

      #!/usr/bin/env perl use warnings; use 5.014; my $cnt = 5000000; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4340 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23936 @y = ("abc") x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 455604 _END_ say "---"; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4544 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23972 @y = split /\n/, $x; say `ps -orss $$`=~s/\s+/ /gr; # RSS 423544 _END_ say "---"; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4540 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23964 for (split /\n/, $x) { say `ps -orss $$`=~s/\s+/ /gr; # RSS 457512 last } _END_ say "---"; system($^X,'-sE',<<'_END_','--',"-cnt=$cnt"); say `ps -orss $$`=~s/\s+/ /gr; # RSS 4336 $x = "abc\n" x $cnt; say `ps -orss $$`=~s/\s+/ /gr; # RSS 23932 open my $fh, '<', \$x or die $!; while (<$fh>) { say `ps -orss $$`=~s/\s+/ /gr; # RSS 24000 last } _END_
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1194593]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-16 11:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found