Re: Performance problems on splitting long strings

The following is the post I prepared yesterday evening (after 2 a.m. this morning in fact), but I guess I was too tired: I previewed it, but stupidly forgot to hit the create button.

OK, now i ran a benchmark with the various possibilities. The results might be useful to others.

I tried 8 different solutions: a "C_style" solution (converting the string into an array of individual characters), two regexes (one with /\w{5}/ and one with /.{5}/, one with the opening of a file handler on a reference to the string, two variations on a loop with the substr function, the split solution offered by Kenosis (although I probably would not be able to use it with the old version of Perl that we have on our servers, but I could test it at home on my more recent version) and unpack.

The following is the code (borrowed in part from Kenosis):

use strict;
use warnings;
use Benchmark qw/cmpthese/;

my $string = (join '', 'a'..'z') x 10;

my $unpack = sub {
    my @sub_fields = unpack '(A5)*', $string;
};

my $regex1 = sub {
    my @sub_fields = $string =~ /\w{5}/g;
};

my $regex2 = sub {
    my @sub_fields = $string =~ /.{5}/g;
};

my $split = sub { # suggested by Kenosis
    my @sub_fields = split /.{5}\K/, $string;
};

my $substr1 = sub {
    my @sub_fields;
    for ( my $i = 0 ; $i < length $string ; $i += 5 ) {
        push @sub_fields, substr $string, $i, 5;
    }
};

my $substr2 = sub {
    my @sub_fields;
    my $max = (length $string)/5 -1;
    push @sub_fields, substr $string, $_*5, 5  for (0..$max);
};

my $filehandle = sub {
    my (@sub_fields, $var);
    open my $FH, "<", \$string or die "cannot open $string $!";
    push @sub_fields, $var while read $FH, $var, 5;
};

my $c_style_string = sub { # the idea suggested by boftx
    my @sub_fields;
    my @chars = split //, $string;
    while (@chars) { push @sub_fields, join '', splice (@chars, 0, 5)}
+;
};

cmpthese( -1,
    {     regex1  => sub {$regex1->()},
        regex2  => sub {$regex2->()},
        unpack => sub {$unpack->()},       
        split  => sub {$split->()},
        substr1 => sub { $substr1->()},
        substr2 => sub {$substr2->()},
        FH => sub {$filehandle->()},
        C_Style => sub { $c_style_string->()}
    }
)
[download]

And these are the results:

           Rate C_Style  regex1  regex2      FH   split substr1 substr
+2  unpack
C_Style  5598/s      --    -72%    -72%    -78%    -79%    -80%    -80
+%    -83%
regex1  19690/s    252%      --     -1%    -24%    -28%    -28%    -31
+%    -39%
regex2  19968/s    257%      1%      --    -23%    -26%    -27%    -30
+%    -38%
FH      25984/s    364%     32%     30%      --     -4%     -5%     -9
+%    -20%
split   27161/s    385%     38%     36%      5%      --     -1%     -5
+%    -16%
substr1 27355/s    389%     39%     37%      5%      1%      --     -4
+%    -16%
substr2 28523/s    409%     45%     43%     10%      5%      4%      -
+-    -12%
unpack  32428/s    479%     65%     62%     25%     19%     19%     14
+%      --
[download]

So unpack wins clearly the race, but I was surprised to see that substr is not that far behind.

Update this evening (Jan 31, 2014 at 18:45): I incorporated the unpack solution in my program at work today, and the speed gain I obtained on my real data is significantly better than what could be derived from the figures of the benchmark above. The profiling shows that the modified code line runs surprisingly almost twice faster than the original one.

Comment on Re: Performance problems on splitting long strings Select or Download Code

Replies are listed 'Best First'.
Re^2: Performance problems on splitting long strings by Jim (Curate) on Feb 01, 2014 at 08:27 UTC
You can simplify the calls to the anonymous subroutines. `cmpthese(-1, { regex1 => $regex1, regex2 => $regex2, unpack => $unpack, split => $split, substr1 => $substr1, substr2 => $substr2, FH => $filehandle, C_Style => $c_style_string, } );` [download] IMHO, the superior performance of `unpack()` is perfectly predicable. This is what it's for. Jim	[reply] [d/l] [select]
Re^3: Performance problems on splitting long strings by Laurent_R (Canon) on Feb 01, 2014 at 11:19 UTC
You can simplify the calls to the anonymous subroutines. ... Thank you for your comment, Jim. And, yes, I wanted to write something like that, and that the reason why I built references to anonymous subs in the first place, rather than simple named functions. But for some reason, I got something wrong in the syntax for calling the subs in `cmpthese`, I am not sure to remember exactly, but I think I first did something like: `regex1 => $regex1->(),` [download] which gave compile errors. The first quick way I found to make it work was to wrap the function call in a sub block like this: `regex1 => sub {$regex1->()},` [download] I realize that this is not the most elegant construct, but once it worked, I was happy enough to get my results and I was too tired, at around 2 a.m., to spend more time investigating further how to simplify the calls. And yes, I was sort of expecting `unpack()` to be faster, but it is still better to try it to be sure.	[reply] [d/l] [select]
Re^4: Performance problems on splitting long strings by AnomalousMonk (Archbishop) on Feb 01, 2014 at 19:55 UTC
... I think I first did something like: `regex1 => $regex1->(),` which gave compile errors. You may know this already, but `$regex1->()` is a function invocation. The `{ ... }` anonymous hash reference constructor tries to treat the first item returned by this function call as a value to be paired with, in this case, the key `'regex1'`. If the number of items in the list consisting in the grand total of all such keys and invocations is odd, the constructor will fail with an "Odd number of elements in anonymous hash ..." error. If it is even, the hash constructed will be a meaningless mish-mosh unless the referenced functions are designed to return valid hash elements, which in this case they are not. The expression `sub {$regex1->()}` produces a single code reference, which pairs as a value quite happily with any key string. It is redundant in that it simply wraps the invocation of another code reference, but this point has already been covered.	[reply] [d/l] [select]
Re^5: Performance problems on splitting long strings by Laurent_R (Canon) on Feb 01, 2014 at 23:24 UTC
Re^6: Performance problems on splitting long strings by AnomalousMonk (Archbishop) on Feb 02, 2014 at 08:29 UTC
Re^4: Performance problems on splitting long strings by Jim (Curate) on Feb 01, 2014 at 22:25 UTC
And yes, I was sort of expecting `unpack()` to be faster, but it is still better to try it to be sure. Indeed. I certainly didn't mean to suggest your benchmark test wasn't a splendid idea. I only meant that, in this case, the outcome of the benchmark test was consistent with one's rational expectation based on one's understanding of `unpack()` and its raison d'être. After all, the Perl FAQ states: `C:\>perldoc -q fixed \| head -5 Found in C:\strawberry\perl\lib\perlfaq5.pod How can I manipulate fixed-record-length files? The most efficient way is using pack() and unpack(). This is faste +r than using substr() when taking many, many strings. It is slower for ju +st a few. C:\>` [download] Jim	[reply] [d/l] [select]