in reply to Re^2: stripped punctuation
in thread stripped punctuation

Except you want to strip punctuation from the beginning or end. The above regex only works if there is punctuation at both beginning and end.

If removing any trailing/leading punctuation is in fact your goal, what about something like:

use strict; use warnings; my $word = 'Wilmer",'; $word =~ s/^ \W*? # ignore any leading punc ( \w .*? ) # swallow everything lazily (?: \W+ )? $ # ignore any trailing punc /$1/x; print $word;

Update: Mind you, at that point, a much simpler regex will likely serve you better in terms of speed and readability:

$word =~ s/(?:^\W+)|(?:\W+$)//g;

Final update - benchmark:

Rate capture non_capture capture 16561/s -- -28% non_capture 22861/s 38% --

The second suggestion is about 30% faster, on average.

Additionally, \w doesn't mean what you think it means.

Replies are listed 'Best First'.
Re^4: stripped punctuation
by thealienz1 (Pilgrim) on Oct 06, 2005 at 21:37 UTC

    I did basically your second regexp there in two steps. I will try the yours, though. I am curious the difference in speed between them. Of course I am wondering what you mean by \w doesn't mean what I think I mean.

      from perlre:

      A "\w" matches a single alphanumeric character (an alphabetic character, or a decimal digit) or "_"...

      Thus your earlier use of [^\w\d] had the set of digits in it twice, which suggested to me that you thought that \w means [A-Za-z].

      [^\w\d] works, but is redundant and equivalent to \W

      Update: You asked what the speed difference between the two passes and one pass:

      s/(?:^\W+)|(?:\W+$)//g; # versus s/\W+$//g; s/^\W+//g; # my unscientific benchmark Rate single_pass two_pass single_pass 15829/s -- -11% two_pass 17737/s 12% --

      Doing it in two passes seems to be about 10-15% faster.

      The \w means "any alphanumeric character or underscore." So in your regex, where you have [^\w\d] it's a bit redundant. \w can be replaced by [a-zA-Z0-9_] so you're writing [a-zA-Z0-9_0-9] in the regexen above.

      Also note that since \w includes the underscore you're matching more than what you say you want.