"Capturing the nth word in a string" Algorithm Analysis

Roger has asked for the wisdom of the Perl Monks concerning the following question:

Firstly, thanks everyone who responded to my previous post on Capturing the nth word in a string!

I have benchmarked the algorithms everyone has provided to gain some insight into how each algorithm behave in Perl.

use strict;
use Benchmark qw/timethese/;
my $str;
my $index = 82;

# construct a small string
foreach (1..100) { $str .= "Element$_" . ' ' }

timethese(100000, {
  'FindNth_Roger' => '&FindNth_Roger',
  'FindNth_Zaxo_Array' => '&FindNth_Zaxo_Array',
  'FindNth_Zaxo_Split' => '&FindNth_Zaxo_Split',
  'FindNth_Ysth' => '&FindNth_Ysth',
  'FindNth_Pg' => '&FindNth_Pg',
  'FindNth_Grantm' => '&FindNth_Grantm',
  'FindNth_Jasper' => '&FindNth_Jasper',
});

sub FindNth_Roger() {
    my $nth = @{[$str =~ m/\w+/g]}[$index-1];
}

sub FindNth_Zaxo_Array() {
    my $nth = ($str =~ m/\w+/g)[$index-1];
}

sub FindNth_Zaxo_Split() {
    my $nth = (split ' ', $str)[$index-1];
}

sub FindNth_Ysth() {
    my $nth = [$str =~ m/\w+/g]->[$index-1];
}

sub FindNth_Pg() {
    my ($nth) = ($str =~ m/(\w+\s*){$index}/);
}

sub FindNth_Grantm() {
    my $idx = $index - 1;
    my ($nth) = $str =~ /(?:\w+\W+){$idx}(\w+)/;
}

sub FindNth_Jasper() {
    my $nth;
    my %h;
    for ($str=~/.?/g) {
        $h{space} += !/\S/;
        $nth .= $_ if $h{space} == $index-1 && /\S/ .. !/\S/
    }
}
[download]

And the following is the result of the benchmark -

Benchmark: timing 100000 iterations of FindNth_Grantm, FindNth_Jasper,
FindNth_Pg, FindNth_Roger, FindNth_Ysth, FindNth_Zaxo_Array,
FindNth_Zaxo_Split...

FindNth_Grantm:  3 wallclock secs (3.50 usr +  0.00 sys =  3.50 CPU)
@ 28571.43/s
FindNth_Jasper: 379 wallclock secs (376.66 usr +  0.06 sys = 376.72 CP
+U)
@ 265.45/s
FindNth_Pg:  4 wallclock secs (3.91 usr +  0.00 sys =  3.91 CPU)
@ 25595.09/s
FindNth_Roger: 21 wallclock secs (21.03 usr +  0.00 sys = 21.03 CPU)
@ 4754.66/s
FindNth_Ysth: 21 wallclock secs (20.95 usr +  0.00 sys = 20.95 CPU)
@ 4772.36/s
FindNth_Zaxo_Array: 19 wallclock secs (18.53 usr +  0.00 sys = 18.53 C
+PU)
@ 5396.36/s
FindNth_Zaxo_Split: 14 wallclock secs (14.51 usr +  0.02 sys = 14.53 C
+PU)
@ 6882.31/s
[download]

It probably doesn't do justice to Jasper's solution, because it uses a hash table with name look up to store space position, which is a considerable overhead. I would probably change the $h{space} with a simpler $space scalar. But still the overhead of "pulling one character at a time and check if it is a space" is big.

In Roger's (mine) solution, I constructed an anonymous array on top of the array that perl regular expression created. The overhead is about 15% greater than using the array returned by the regular expression directly as suggested by Zaxo.

Zaxo's split approach is about 30 percent faster than the regexp array reference approach. Which suggests that the (split /\s+/, $str) operation is about 30 percent faster than ($str =~ /\w+/g). The conclusion is that split is more efficient on breaking up a string than using regular expression.

The solution suggested by Ysth is equivalent to Roger's solution with roughly the same performance. This suggests that the @{ ... } operator in Perl is quite efficient.

The solution provided by Grantm and Pg are both very efficient, both at least 5 times faster than other algorithms.

Grantm's algorithm is a touch faster than Pg's algorithm. I took a closer look at the differences -

# Pg's approach
my ($nth) = ($str =~ m/(\w+\s*){$index}/);

# Grantm's approach
my $idx = $index - 1;
my ($nth) = $str =~ /(?:\w+\W+){$idx}(\w+)/;
[download]

But I failed to understand why Grantm's approach is slightly faster than Pg's approach. I think it's either because of the \s* operator, or the (?:..) operator. Perhaps other monks can enlighten me on this.

Nevertheless, I came up with the following conclusions:

Perl internal split function is the fastest way to split a simple string;

Perl memory operations are relatively expensive, eg., building an anonymous array / slice;

The algorithm of scanning through the entire array before dereferencing has the efficiency of O, where grabbing the word without directly going through the array is O/2, however Perl regular expression engine optimized it to log(O)?

The approach to solve similar problems in Perl in the future

see if can get a singular result directly from regular expression, this takes advantage of the Perl regular expression optimization;

use Perl internal function whenever possible;

look for ways to avoid using temporary hashes or arrays if possible;

What are your thoughts? Please correct my analysis if wrong or incomplete (which I somewhat suspect).

Thanks and regards.

Edited by castaway - changed 'a' tag to a pm id:// link, added a readmore tag.

Comment on "Capturing the nth word in a string" Algorithm Analysis Select or Download Code

Replies are listed 'Best First'.
Re: "Capturing the nth word in a string" Algorithm Analysis by pg (Canon) on Oct 15, 2003 at 02:17 UTC
Here is the reason why my approach was a 'touch' slower than Grantm's, but both are much faster than others: Say both of us are trying to match the 3rd word, his regexp only matches the 3rd word, but for the first word and the second word, there is no matching done (to be more precise, no extra effort to make sure backref can happen); But in my case, I am matching all 3 words up to the 3rd one. Although I end up with having the third word stored in $1 just as his does, extra effort are taken to handle the 1st and 2nd words, to make sure the backref for the immediate next one works. But in both cases (mine and Grantm's), the effort is only up to the 3rd word, not like in some other cases, the matching continues to the end of the string. Performance (timing) is just one benefit gained here, the other benefit is we used less space, as we don't create an array to hold all words. Now let's play a little bit math. Remember you mentioned the O and O/2 observation? Assume we have a string formed up with m words, and the effort to matching one word is n, in the testing you designed: If the approach is trying to match the entire string, then it will take nmm. If the approach only matches up to the given ith word, the cost is (1 + 2 + 3 + ....+ m) * n = m(m+1)n/2. So they are close to O : O/2.	[reply]
Re: Re: "Capturing the nth word in a string" Algorithm Analysis by Roger (Parson) on Oct 15, 2003 at 02:37 UTC
Hi Pg, thanks for your detailed analysis. I understand a bit better now.	[reply]
Re: "Capturing the nth word in a string" Algorithm Analysis by Abigail-II (Bishop) on Oct 15, 2003 at 09:55 UTC
Not so fast. All you've done is compare the results when applying it on a particular string. If you vary the string a bit (removing whitespace, 50 words instead of 100, replacing the 'word' chars with 'non-word' chars), you get different results. Certainly, those strings don't match, but it's important to take the 'time to failure' into account as well. Below some reworked code. It's still using the same approaches, with the exception of 'Pg's approach. It scored well on the matching case, but if you take out the whitespace in the string, it's too slow to wait for an answer. use strict; use Benchmark qw/cmpthese/; my @str; my $index = 82; # construct a small string foreach (1 .. 100) { $str [0] .= "Element$_" . ' '; $str [1] .= "Element$_"; } foreach (1 .. 50) { $str [2] .= "Element$_" . ' '; $str [3] .= "Element$_"; } $str [4] = "&()-=-_<> " x 100; $str [5] = "&()-=-_<>" x 100; $str [6] = "&()-=-_<> " x 50; $str [7] = "&()-=-_<>" x 50; my @info = ( "100 words separated by whitespace", "Same, without the whitespace", "50 words separated by whitespace", "Same, without the whitespace", "100 non-words separated by whitespace", "Same, without the whitespace", "50 non-words separated by whitespace", "Same, without the whitespace", ); my $str; for (my $i = 0; $i < @str && $i < @info; $i ++) { print $info [$i], "\n"; $str = $str [$i]; cmpthese (-1, { Roger => \&FindNth_Roger, Zaxo_Array => \&FindNth_Zaxo_Array, Zaxo_Split => \&FindNth_Zaxo_Split, Ysth => \&FindNth_Ysth, # Pg => \&FindNth_Pg, Grantm => \&FindNth_Grantm, Jasper => \&FindNth_Jasper, }); } sub FindNth_Roger() { my $nth = @{[$str =~ m/\w+/g]}[$index-1]; } sub FindNth_Zaxo_Array() { my $nth = ($str =~ m/\w+/g)[$index-1]; } sub FindNth_Zaxo_Split() { my $nth = (split ' ', $str)[$index-1]; } sub FindNth_Ysth() { my $nth = [$str =~ m/\w+/g]->[$index-1]; } sub FindNth_Pg() { my ($nth) = ($str =~ m/(\w+\s*){$index}/); } sub FindNth_Grantm() { my $idx = $index - 1; my ($nth) = $str =~ /(?:\w+\W+){$idx}(\w+)/; } sub FindNth_Jasper() { my $nth; my %h; for ($str=~/.?/g) { $h{space} += !/\S/; $nth .= $_ if $h{space} == $index-1 && /\S/ .. !/\S/ } } __END__ 100 words separated by whitespace (warning: too few iterations for a reliable count) Rate Jasper Roger Ysth Zaxo_Array Zaxo_Split + Grantm Jasper 429/s -- -95% -95% -96% -98% + -98% Roger 8614/s 1910% -- -0% -19% -54% + -69% Ysth 8650/s 1918% 0% -- -19% -54% + -69% Zaxo_Array 10676/s 2391% 24% 23% -- -43% + -62% Zaxo_Split 18797/s 4286% 118% 117% 76% -- + -33% Grantm 27926/s 6416% 224% 223% 162% 49% + -- Same, without the whitespace Rate Jasper Grantm Roger Ysth Zaxo_Array +Zaxo_Split Jasper 491/s -- -93% -100% -100% -100% + -100% Grantm 6636/s 1250% -- -94% -94% -95% + -95% Roger 110277/s 22340% 1562% -- -4% -22% + -23% Ysth 114840/s 23269% 1631% 4% -- -19% + -20% Zaxo_Array 141940/s 28783% 2039% 29% 24% -- + -1% Zaxo_Split 143360/s 29072% 2060% 30% 25% 1% + -- 50 words separated by whitespace Rate Grantm Jasper Roger Ysth Zaxo_Array +Zaxo_Split Grantm 610/s -- -32% -96% -96% -97% + -98% Jasper 896/s 47% -- -95% -95% -96% + -97% Roger 16440/s 2595% 1734% -- -1% -20% + -54% Ysth 16593/s 2620% 1751% 1% -- -19% + -53% Zaxo_Array 20479/s 3257% 2185% 25% 23% -- + -42% Zaxo_Split 35544/s 5727% 3866% 116% 114% 74% + -- Same, without the whitespace Rate Jasper Grantm Roger Ysth Zaxo_Array +Zaxo_Split Jasper 984/s -- -93% -99% -99% -100% + -100% Grantm 13398/s 1262% -- -92% -92% -94% + -94% Roger 165414/s 16714% 1135% -- -6% -29% + -31% Ysth 175812/s 17770% 1212% 6% -- -25% + -26% Zaxo_Array 234057/s 23691% 1647% 41% 33% -- + -2% Zaxo_Split 238602/s 24153% 1681% 44% 36% 2% + -- 100 non-words separated by whitespace Rate Jasper Roger Ysth Zaxo_Array Zaxo_Split + Grantm Jasper 393/s -- -96% -96% -97% -98% + -99% Roger 9050/s 2206% -- 0% -20% -51% + -67% Ysth 9050/s 2206% 0% -- -20% -51% + -67% Zaxo_Array 11270/s 2771% 25% 25% -- -39% + -60% Zaxo_Split 18442/s 4598% 104% 104% 64% -- + -34% Grantm 27836/s 6992% 208% 208% 147% 51% + -- Same, without the whitespace Rate Jasper Ysth Roger Zaxo_Array Grantm +Zaxo_Split Jasper 432/s -- -95% -95% -96% -98% + -100% Ysth 9050/s 1993% -- -0% -20% -67% + -93% Roger 9050/s 1993% 0% -- -20% -67% + -93% Zaxo_Array 11270/s 2506% 25% 25% -- -60% + -91% Grantm 27836/s 6337% 208% 208% 147% -- + -79% Zaxo_Split 130327/s 30038% 1340% 1340% 1056% 368% + -- 50 non-words separated by whitespace Rate Grantm Jasper Roger Ysth Zaxo_Array +Zaxo_Split Grantm 369/s -- -53% -98% -98% -98% + -99% Jasper 785/s 113% -- -95% -95% -96% + -98% Roger 17231/s 4568% 2095% -- -0% -22% + -51% Ysth 17300/s 4586% 2104% 0% -- -21% + -51% Zaxo_Array 21976/s 5853% 2699% 28% 27% -- + -38% Zaxo_Split 35484/s 9512% 4420% 106% 105% 61% + -- Same, without the whitespace Rate Grantm Jasper Roger Ysth Zaxo_Array +Zaxo_Split Grantm 410/s -- -52% -98% -98% -98% + -100% Jasper 861/s 110% -- -95% -95% -96% + -100% Roger 17230/s 4101% 1902% -- -1% -22% + -92% Ysth 17397/s 4142% 1922% 1% -- -21% + -92% Zaxo_Array 21976/s 5259% 2454% 28% 26% -- + -90% Zaxo_Split 219429/s 53407% 25398% 1174% 1161% 899% + -- [download] Note how 'Grantm' sometimes is the fastest, sometimes is the slowest. Abigail	[reply] [d/l]
Re: Re: "Capturing the nth word in a string" Algorithm Analysis by Roger (Parson) on Oct 15, 2003 at 10:26 UTC
Thanks Abigail for the exellent analysis you have provided. Indeed I was a bit short-sighted when I made the conclusion earlier, where I failed to explore the case of un-matches. Interesting to see that the split and direct array reference methods are not bad after all. :)	[reply]
Re: "Capturing the nth word in a string" Algorithm Analysis by diotalevi (Canon) on Oct 15, 2003 at 14:12 UTC
The conclusion is that split is more efficient on breaking up a string than using regular expression. This is a quibble on perl internals. Split's first argument is always, one hundred percent of the time time a regular expression. Ninety-nine percent of the time the expression is given literally to split whether as a string literal (`split "\\s+"`) or a plain expression (`split /\s+/`). When given the string literal `' '` split really, truely does use /\s+/. (it also does a quick skip of leading whitespace). In that sense split can't be said to be that different from a regex since it internally uses the regex engine all the time. Quibble off.	[reply] [d/l] [select]
Re: "Capturing the nth word in a string" Algorithm Analysis by cbraga (Pilgrim) on Oct 15, 2003 at 02:01 UTC
Interesting results, but nothing terribly new either. I'd have found them easier to read if you condensed the text a bit though. `ESC[78;89;13p ESC[110;121;13p`	[reply]