comment on

Firstly, thanks everyone who responded to my previous post on Capturing the nth word in a string!

I have benchmarked the algorithms everyone has provided to gain some insight into how each algorithm behave in Perl.

use strict;
use Benchmark qw/timethese/;
my $str;
my $index = 82;

# construct a small string
foreach (1..100) { $str .= "Element$_" . ' ' }

timethese(100000, {
  'FindNth_Roger' => '&FindNth_Roger',
  'FindNth_Zaxo_Array' => '&FindNth_Zaxo_Array',
  'FindNth_Zaxo_Split' => '&FindNth_Zaxo_Split',
  'FindNth_Ysth' => '&FindNth_Ysth',
  'FindNth_Pg' => '&FindNth_Pg',
  'FindNth_Grantm' => '&FindNth_Grantm',
  'FindNth_Jasper' => '&FindNth_Jasper',
});

sub FindNth_Roger() {
    my $nth = @{[$str =~ m/\w+/g]}[$index-1];
}

sub FindNth_Zaxo_Array() {
    my $nth = ($str =~ m/\w+/g)[$index-1];
}

sub FindNth_Zaxo_Split() {
    my $nth = (split ' ', $str)[$index-1];
}

sub FindNth_Ysth() {
    my $nth = [$str =~ m/\w+/g]->[$index-1];
}

sub FindNth_Pg() {
    my ($nth) = ($str =~ m/(\w+\s*){$index}/);
}

sub FindNth_Grantm() {
    my $idx = $index - 1;
    my ($nth) = $str =~ /(?:\w+\W+){$idx}(\w+)/;
}

sub FindNth_Jasper() {
    my $nth;
    my %h;
    for ($str=~/.?/g) {
        $h{space} += !/\S/;
        $nth .= $_ if $h{space} == $index-1 && /\S/ .. !/\S/
    }
}
[download]

And the following is the result of the benchmark -

Benchmark: timing 100000 iterations of FindNth_Grantm, FindNth_Jasper,
FindNth_Pg, FindNth_Roger, FindNth_Ysth, FindNth_Zaxo_Array,
FindNth_Zaxo_Split...

FindNth_Grantm:  3 wallclock secs (3.50 usr +  0.00 sys =  3.50 CPU)
@ 28571.43/s
FindNth_Jasper: 379 wallclock secs (376.66 usr +  0.06 sys = 376.72 CP
+U)
@ 265.45/s
FindNth_Pg:  4 wallclock secs (3.91 usr +  0.00 sys =  3.91 CPU)
@ 25595.09/s
FindNth_Roger: 21 wallclock secs (21.03 usr +  0.00 sys = 21.03 CPU)
@ 4754.66/s
FindNth_Ysth: 21 wallclock secs (20.95 usr +  0.00 sys = 20.95 CPU)
@ 4772.36/s
FindNth_Zaxo_Array: 19 wallclock secs (18.53 usr +  0.00 sys = 18.53 C
+PU)
@ 5396.36/s
FindNth_Zaxo_Split: 14 wallclock secs (14.51 usr +  0.02 sys = 14.53 C
+PU)
@ 6882.31/s
[download]

It probably doesn't do justice to Jasper's solution, because it uses a hash table with name look up to store space position, which is a considerable overhead. I would probably change the $h{space} with a simpler $space scalar. But still the overhead of "pulling one character at a time and check if it is a space" is big.

In Roger's (mine) solution, I constructed an anonymous array on top of the array that perl regular expression created. The overhead is about 15% greater than using the array returned by the regular expression directly as suggested by Zaxo.

Zaxo's split approach is about 30 percent faster than the regexp array reference approach. Which suggests that the (split /\s+/, $str) operation is about 30 percent faster than ($str =~ /\w+/g). The conclusion is that split is more efficient on breaking up a string than using regular expression.

The solution suggested by Ysth is equivalent to Roger's solution with roughly the same performance. This suggests that the @{ ... } operator in Perl is quite efficient.

The solution provided by Grantm and Pg are both very efficient, both at least 5 times faster than other algorithms.

Grantm's algorithm is a touch faster than Pg's algorithm. I took a closer look at the differences -

# Pg's approach
my ($nth) = ($str =~ m/(\w+\s*){$index}/);

# Grantm's approach
my $idx = $index - 1;
my ($nth) = $str =~ /(?:\w+\W+){$idx}(\w+)/;
[download]

But I failed to understand why Grantm's approach is slightly faster than Pg's approach. I think it's either because of the \s* operator, or the (?:..) operator. Perhaps other monks can enlighten me on this.

Nevertheless, I came up with the following conclusions:

Perl internal split function is the fastest way to split a simple string;

Perl memory operations are relatively expensive, eg., building an anonymous array / slice;

The algorithm of scanning through the entire array before dereferencing has the efficiency of O, where grabbing the word without directly going through the array is O/2, however Perl regular expression engine optimized it to log(O)?

The approach to solve similar problems in Perl in the future

see if can get a singular result directly from regular expression, this takes advantage of the Perl regular expression optimization;

use Perl internal function whenever possible;

look for ways to avoid using temporary hashes or arrays if possible;

What are your thoughts? Please correct my analysis if wrong or incomplete (which I somewhat suspect).

Thanks and regards.

Edited by castaway - changed 'a' tag to a pm id:// link, added a readmore tag.

In reply to "Capturing the nth word in a string" Algorithm Analysis by Roger

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.