Re: performance enhancement

The "normal" way to do this is with two substitutions: $str =~ s/^\s+//; $str =~ s/\s+\z//;. Is that not fast enough? I see a few problems with String::Strip:

Doesn't handle unicode whitespace when given utf8 input.
Truncates strings that contain null characters. (Also, will violate bounds if fed strings that perl wasn't able to put a "safety" null terminator on.)
When stripping leading spaces, ends up copying the whole string - something that the substitutions optimize away - which will be a disadvantage for large strings.
When copying the string, relies on overlapping strcpy working - something about which the C standard says "the behavior is undefined."

Comment on Re: performance enhancement Download Code

Replies are listed 'Best First'.
Re^2: performance enhancement by demerphq (Chancellor) on Jul 19, 2006 at 22:37 UTC
The "normal" way to do this is with two substitutions Ive often pondered on an optimisation of $s=~s/^\s+\|\s+$/g so that this is no longer true. So far its been over my head in the sense of requiring too much research time to implement compared to other useful tasks that I can do, but maybe one day... And for people wondering why this isn't the recommended way, its because this pattern will try to match every point in the string. The regex engine isnt currently smart enough to optimise this to only try the pattern twice. --- $world=~s/war/peace/g	[reply]
Re^3: performance enhancement by GrandFather (Saint) on Jul 19, 2006 at 23:46 UTC
Why `s/^\s+\|\s$/g` rather than `s/^\s+\|\s+$/g`, `s/^\s\|\s$/g` or `s/^\s\|\s+$/g`? A benchmark suggests the two substitution approach is faster than any of the single substitution approaches and that there are interesting variations between the different single substitution options: Read more... benchmark code (918 Bytes) `Rate starstar plusstar plusplus starplus twosub starstar 47.0/s -- -8% -25% -28% -42% plusstar 51.2/s 9% -- -18% -21% -37% plusplus 62.5/s 33% 22% -- -4% -23% starplus 65.1/s 39% 27% 4% -- -20% twosub 81.6/s 74% 59% 31% 25% --` [download] The benchmark uses a single large string (100_000 characters) with a fairly large run of spaces (1000) at the start and end. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^4: performance enhancement by demerphq (Chancellor) on Jul 20, 2006 at 12:30 UTC
Why s/^\s+\|\s$/g rather than s/^\s+\|\s+$/g, s/^\s\|\s$/g or s/^\s\|\s+$/g? Er, the quantifier mismatch was a typo. I have corrected the original node. But its good as you can see the speed advantage of the twosub method. Although I suspect you would see a radically different result if the string were more "normal" for instance the content of a node, with the intention of triming each line. Also you have to be very careful with benchmarking regexes, really subtle differences in the input string and the pattern can result in wildly different run times due to how the optimiser handles them. For instance if your string/pattern facilitates a single FBM search followed by a match followed by a failing FBM search then its going to be massivley faster than a pattern where a FBM search matches many times, each rejected by the regex engine itself afterwards. FBM is really fast, the regex engine is not. In fact despite the common perception that the regex engine itself is fast Id say its not, rather the reputation comes from using a lot of really tricky optimisations to cut down as much as possible how much the regex engine proper is involved. In other words the perl regex engine is perceived as fast mostly because we do our damndest not to use it when we dont need to. --- $world=~s/war/peace/g	[reply]