in reply to Re^2: script optmization
in thread script optmization

G'day Marshall,

Thanks for the positive feedback. I have some comments on your first three points.

Re "... fastest way to remove leading and trailing white space ...". I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book: the regex I used was anchored at both ends (/^\s*(.*?)\s*$/). In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on. I wrote this benchmark:

#!/usr/bin/env perl -l use strict; use warnings; use constant STRING => " \t aaa bbb ccc \t \n"; use Benchmark 'cmpthese'; print 'Sanity Tests:'; print 'shoura: >', shoura_code(), '<'; print 'kcott: >', kcott_code(), '<'; print 'marshall: >', marshall_code(), '<'; cmpthese 0 => { S => \&shoura_code, K => \&kcott_code, M => \&marshall_code, }; sub shoura_code { local $_ = STRING; chomp; s/^\s+|\s+$//g; return $_; } sub kcott_code { local $_ = STRING; ($_) = /^\s*(.*?)\s*$/; return $_; } sub marshall_code { local $_ = STRING; s/^\s+//; s/\s+$//; return $_; }

I ran it five times — that's usual for me — here's the result that was closest to an average:

Sanity Tests: shoura: >aaa bbb ccc< kcott: >aaa bbb ccc< marshall: >aaa bbb ccc< Rate S M K S 292306/s -- -32% -37% M 432626/s 48% -- -7% K 464863/s 59% 7% --

There was quite a lot of variance; although 'K' was always faster than 'M'. The five K-M percentages were: 9, 7, 2, 14, 7. Both 'K' and 'M' were always substantially faster than 'S'.

Re "... split your $re statement into two parts ...". I often use the '@{[...]}' construct when interpolating the results of some processing into a string. My main intent was to create the regex once, instead of the (presumably) millions of times in the inner loop of the OP's code. I also benchmarked this (see the spoiler): it looks like your total saving would be measured in nanoseconds.

Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.". I can understand that from the minimal test data supplied by the OP; however, the reason is probably to handle sequences with common sections. Consider the test data I used in the second benchmark:

my %seq = ( 'W X Y' => 'WbbbXbbbY', 'X Y' => 'XbbbY', 'X Y Z' => 'XbbbYbbbZ', );

If the target string was "W X Y Z", the results could one of these three:

W XbbbY Z WbbbXbbbY Z W XbbbYbbbZ

Sorting by length would reduce that to two results. There may well be a requirement to also sort lexically. Perhaps like this:

sort { length $b <=> length $a || $a cmp $b }

But the OP has not given sufficient information. In fact, as I write this, it's been almost two days since the original posting and all requests for additional information have been ignored.

— Ken