Re: Performance problems on splitting long strings
by Kenosis (Priest) on Jan 30, 2014 at 20:02 UTC
|
use strict;
use warnings;
use Tie::CharArray;
use Benchmark qw/cmpthese/;
my $string = join '', 'A' .. 'Y';
sub _unpack {
my @arr = unpack '(A5)*', $string;
}
sub _regex {
my @arr = $string =~ /.{5}/g;
}
sub _split {
my @arr = split /.{5}\K/, $string;
}
sub _substr {
my @arr;
for ( my $i = 0 ; $i < length $string ; $i += 5 ) {
push @arr, substr $string, $i, 5;
}
}
sub _open {
my @arr;
open my $sh, '<', \$string;
while ( read $sh, my $chars, 5 ) {
push @arr, $chars;
}
}
cmpthese(
-5,
{
_unpack => sub { _unpack() },
_regex => sub { _regex() },
_split => sub { _split() },
_substr => sub { _substr() },
_open => sub { _open() }
}
);
Output:
Rate _open _regex _substr _split _unpack
_open 265986/s -- -53% -55% -57% -70%
_regex 563780/s 112% -- -5% -8% -36%
_substr 593788/s 123% 5% -- -3% -33%
_split 612001/s 130% 9% 3% -- -31%
_unpack 881949/s 232% 56% 49% 44% --
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Rate _substr _unpack _regex _split
_substr 2187335/s -- -11% -16% -20%
_unpack 2457294/s 12% -- -6% -10%
_regex 2612321/s 19% 6% -- -4%
_split 2726283/s 25% 11% 4% --
Perl code:
use strict;
use warnings;
use Benchmark qw/cmpthese/;
my $string = join '', 'A' .. 'Y';
sub _unpack {
my @arr = unpack '(A5)*', $string;
}
sub _regex {
my @arr;
while (length $string){ $string =~ s/^(.{5})//; push @arr, $1; }
}
sub _split {
my @arr = split /.{5}\K/, $string;
}
sub _substr {
my @arr;
for ( my $i = 0 ; $i < length $string ; $i += 5 ) {
push @arr, substr $string, $i, 5;
}
}
cmpthese(
-5,
{
_unpack => sub { _unpack() },
_split => sub { _split() },
_substr => sub { _substr() },
_regex => sub { _regex() }
}
);
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Your benchmark is totally broken.
When your _regex() function runs the first time, it complete destroys $string; and everytime after that the regex is operating on an empty string and thus runs very quicly. Ditto, every other test that runs after the first run of _regex().
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Performance problems on splitting long strings
by Not_a_Number (Prior) on Jan 30, 2014 at 19:26 UTC
|
Is there any way of saying something like: "A4" repeated as many times as possible (until the end of the string?
Yes:
my @array = unpack '(A4)*', $string;
| [reply] [Watch: Dir/Any] [d/l] |
|
Thank you very much, I had overseen this option of unpack. Yous answer is really useful for me.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Performance problems on splitting long strings
by davido (Cardinal) on Jan 31, 2014 at 00:45 UTC
|
Why does the long string have to be split into individual containers? Can you work with the large string, just in segments at offset multiples of five?
unpack is already lightning fast, but even less work is being done if you can avoid making a copies. Of course this is Perl, and sometimes the more idiomatic approaches (unpack, for example) can be faster than algorithmic approaches (walking through the existing string without making copies), but if unpack is still too slow, and if what you are using those little substrings for could be done in-place, it might be worth attempting a solution that works on the original string.
| [reply] [Watch: Dir/Any] |
|
>perl -wMstrict -le
"use constant LEN => 5;
;;
my $s = '1234598765555553456733333';
print qq{'$s'};
;;
my %dispatch = (
'55555' => sub { return 'x' x length $_[0]; },
);
;;
for (my $offset = 0; $offset < length $s; $offset += LEN) {
for (substr $s, $offset, LEN) {
$_ = exists $dispatch{$_} ? $dispatch{$_}->($_) : $_ + 2;
}
}
print qq{'$s'};
"
'1234598765555553456733333'
'1234798767xxxxx3456933335'
Update: Changed example code to also exemplify topicalization of sub-string segment via for-structure (given no longer being quite kosher).
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Hi Dave,
thank you for your input, the truth of the matter is that I am the victim of earlier poor design. The long string that I am talking about is a list of billing services for a client. All billing service codes are 5 character-long. What I need to do with this list (with a bit of simplification) is to filter in those services that are of use for our purpose, and to sort them out. The ultimate goal is to compare the two large files after having pre-processed them.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
Re: Performance problems on splitting long strings
by Laurent_R (Canon) on Jan 31, 2014 at 18:33 UTC
|
The following is the post I prepared yesterday evening (after 2 a.m. this morning in fact), but I guess I was too tired: I previewed it, but stupidly forgot to hit the create button.
OK, now i ran a benchmark with the various possibilities.
The results might be useful to others.
I tried 8 different solutions: a "C_style" solution (converting the string into an array of individual characters), two regexes (one with /\w{5}/ and one with /.{5}/, one with the opening of a file handler on a reference to the string, two variations on a loop with the substr function, the split solution offered by Kenosis (although I probably would not be able to use it with the old version of Perl that we have on our servers, but I could test it at home on my more recent version) and unpack.
The following is the code (borrowed in part from Kenosis):
And these are the results:
Rate C_Style regex1 regex2 FH split substr1 substr
+2 unpack
C_Style 5598/s -- -72% -72% -78% -79% -80% -80
+% -83%
regex1 19690/s 252% -- -1% -24% -28% -28% -31
+% -39%
regex2 19968/s 257% 1% -- -23% -26% -27% -30
+% -38%
FH 25984/s 364% 32% 30% -- -4% -5% -9
+% -20%
split 27161/s 385% 38% 36% 5% -- -1% -5
+% -16%
substr1 27355/s 389% 39% 37% 5% 1% -- -4
+% -16%
substr2 28523/s 409% 45% 43% 10% 5% 4% -
+- -12%
unpack 32428/s 479% 65% 62% 25% 19% 19% 14
+% --
So unpack wins clearly the race, but I was surprised to see that substr is not that far behind.
Update this evening (Jan 31, 2014 at 18:45): I incorporated the unpack solution in my program at work today, and the speed gain I obtained on my real data is significantly better than what could be derived from the figures of the benchmark above. The profiling shows that the modified code line runs surprisingly almost twice faster than the original one.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
cmpthese(-1,
{
regex1 => $regex1,
regex2 => $regex2,
unpack => $unpack,
split => $split,
substr1 => $substr1,
substr2 => $substr2,
FH => $filehandle,
C_Style => $c_style_string,
}
);
IMHO, the superior performance of unpack() is perfectly predicable. This is what it's for.
Jim
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
You can simplify the calls to the anonymous subroutines. ...
Thank you for your comment, Jim.
And, yes, I wanted to write something like that, and that the reason why I built references to anonymous subs in the first place, rather than simple named functions. But for some reason, I got something wrong in the syntax for calling the subs in cmpthese, I am not sure to remember exactly, but I think I first did something like:
regex1 => $regex1->(),
which gave compile errors. The first quick way I found to make it work was to wrap the function call in a sub block like this:
regex1 => sub {$regex1->()},
I realize that this is not the most elegant construct, but once it worked, I was happy enough to get my results and I was too tired, at around 2 a.m., to spend more time investigating further how to simplify the calls.
And yes, I was sort of expecting unpack() to be faster, but it is still better to try it to be sure.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
Re: Performance problems on splitting long strings
by hdb (Monsignor) on Jan 30, 2014 at 19:22 UTC
|
| [reply] [Watch: Dir/Any] |
|
Thank you. I know how to use the benchmark module, no problem with that, what I am looking for is some other ideas on how to split my data more efficiently, in order to benchmark these ideas.
| [reply] [Watch: Dir/Any] |
|
use strict;
use warnings;
use Benchmark 'cmpthese';
my $string = map { ('a'..'z')[rand 26] } 1..30;
my @sub_fields;
cmpthese( -1, {
regex1 => sub { @sub_fields = $string =~ /\w{5}/g },
regex2 => sub { @sub_fields = $string =~ /.{5}/g },
unpack => sub { @sub_fields = unpack '(A4)*', $string },
substr => sub { @sub_fields = map { substr $string, 5*$_, 5
+ } 0..length( $string )/5-1 },
});
that already shows that the regex idea is vastly inferior:
Rate substr unpack regex1 regex2
substr 696486/s -- -57% -94% -94%
unpack 1603093/s 130% -- -85% -86%
regex1 10731041/s 1441% 569% -- -4%
regex2 11165392/s 1503% 596% 4% --
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
|
|
|
|
Re: Performance problems on splitting long strings
by boftx (Deacon) on Jan 30, 2014 at 19:49 UTC
|
You would have to benchmark this to be sure, but you might try spliting on '' to get a char array then shift or splice off multiples of 5 chars at a time. (Though I suspect unpack will be faster.)
It helps to remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Performance problems on splitting long strings
by Laurent_R (Canon) on Jan 31, 2014 at 07:45 UTC
|
I made the benchmark yesterday evening (well, finished it around 2 a.m. this morning) and thought I posted it. It seems I must have previewed it, but omitted to create the post. I am no longer at home, posting again will have to wait this evening. To make things short, unpack was the fastest options among those I tested.
| [reply] [Watch: Dir/Any] |