rdj has asked for the wisdom of the Perl Monks concerning the following question:

My day job involved lots of Java and C# for a long time. One of the things that was drilled into me was to avoid simple string concatenation (+=) within a loop -- always use some kind of stream or buffer. I found myself searching for some kind of fast string builder on CPAN, but I didn't find much. IO::String led me to the fact that I could pass a scalar ref instead of a filename to open. That sounded a lot like a familiar MemoryStream. :-) I ran some quick benchmarks, but low-and-behold, the .= operator appears to be magically quick.

Google and friends have failed me in finding a discussion of this surprising (to me) speed. There's lots of discussion of concatenation versus interpolation, but not very much about potentially faster alternatives to concatenation. I guess because it's so fast!

I've included my test at the bottom. If my methodology is incorrect, please let me know. When the string to be concatenated was clearly constant, I think the compiler was having a little fun with me. Once I wrapped it in a function call I came up with these results:

Rate filehandle_OO 484/s pushing 531/s filehandle 785/s concatenation 929/s

Based on the emphasis on Stream-based processing in the languages of my prior experience, my intuition trembles to discover that .= outperforms the filehandle (and, of course, destroys the OO filehandle).

What's going on? Why is string concatenation the fastest thing I've found? Is it somehow avoiding the making-repeated-copies-of-the-same-data trap that Java and C#'s String += operators have?

#!/usr/bin/perl use strict; use warnings; use Benchmark ':all'; use English '-no-match-vars'; use IO::Handle; my $long_string = '.' x 200; sub get_long_string { return $long_string; } sub concatenation { my $output = q{}; $output .= get_long_string() for ( 1 .. 1000 ); return $output; } sub filehandle { my $output = q{}; open my $fh, '>', \$output or die "Huh?! $OS_ERROR"; print $fh get_long_string() for ( 1 .. 1000 ); close $fh; return $output; } sub filehandle_OO { my $output = q{}; open my $fh, '>', \$output or die "Huh?! $OS_ERROR"; $fh->print( get_long_string() ) for ( 1 .. 1000 ); $fh->close(); return $output; } sub pushing { my @output; push @output, get_long_string() for ( 1 .. 1000 ); return join( '', @output ); } cmpthese( 10_000, { 'concatenation' => \&concatenation, 'filehandle' => \&filehandle, 'filehandle_OO' => \&filehandle_OO, 'pushing' => \&pushing } );

Replies are listed 'Best First'.
Re: Is there something faster than string concatenation?
by Joost (Canon) on Dec 03, 2007 at 02:03 UTC
    The main reason Java string concatenation is slow is that java strings are immutable. That is, every concatenation operation will create a new string and copy the text of the two input strings.

    Perl's strings aren't immutable so if you do:

    $some_long_string .= $some_shorter_string;
    Only the contents of $some_shorter_string needs to be copied.

    String file handles have all kinds of overhead and they're more a convenience - I'm surprised the non-oo version is that quick. Object oriented calls are slow in perl anyway relative to function calls.

    The join version is pretty slow though. I wouldn't have expected that.

      Ok, that much makes sense. But I'd think I'd be running into a contiguous memory block problem. I mean, if this were C, I could realloc the buffer that has the first string in it, and if I'm lucky it wouldn't have to move, then I can just copy in the second string.

      Over the course of doing this, every once in a while your realloc is going to need to move/copy. It seems like that should bite you over the long run.

      But perhaps the open scalar ref basically just fronts for string concatenation? The usual approach for like a memory-based buffer from other languages is to grow the buffer larger than I need it for this one copy, so future copies won't have to risk the buffer moving.

      Maybe my test is a little too perfect. Nothing else is competing for memory, so the string easily grows without bumping into anything? Of course, the only time this matters is in a tight loop iterating over a large dataset, where I shouldn't really be doing any extraneous allocations or anything anyway.

        Over the course of doing this, every once in a while your realloc is going to need to move/copy. It seems like that should bite you over the long run.
        Probably. If that happens, the array push/join mechanism may be the fastest. I don't know how string filehandles work exactly, but I bet they do indeed just front for concatenation.

        As you can see, though, usually string concat is the fastest solution. I wouldn't worry about it too much unless you're dealing with *really* huge strings, and by that time may want to move to on-disk files anyway, since you're probably taking up a fairly significant chunk of system memory.

        The size of the string buffer is doubled when it needs to be increased. So even if it has to be copied every time it is resized, that averages out to the equivalent of just one extra copying of the final string. So it isn't a big penalty.

        Update: What is true of arrays need not be true of strings, it appears. Thanks, ikegami.

        - tye        

Re: Is there something faster than string concatenation?
by hossman (Prior) on Dec 10, 2007 at 02:45 UTC
    If it helps to wrap your mind around it, think of perl "strings" as java StringBuffers. For this perl code...
    my $output = ""; for (1 .. 1000) { $output .= get_long_string(); } print $output;
    ...the analogous java code is...
    StringBuffer output = new StringBuffer(); for (1 .. 1000) { output.appends(get_long_string()); } System.out.print(output.toString());
    I would not be surprised however if this perl code behaved as inefficiently as you might expect it to...
    my $output = ""; for (1 .. 1000) { $output = "" . get_long_string() . $output; } print $output;
    (but i haven't tested it ... perl might optimize it)