Re^2: stripped punctuation

After looking at your regexp I took to simplifying my needs with:

$word =~ s/^[^\w\d]+(.*?)[^\w\d]+$/$1/;

My intention is remove everything that is not a letter or number up to the first letter, pull everything up till the last non letter or digit. When I look at it it makes sense, but my testing it does not work.

Update

It works on the simple example I gave for 'Wilmer!'. I was running word count with a script as the input and the odd results I was seeing were the syntax in the script. I apologize.

Comment on Re^2: stripped punctuation Download Code

Replies are listed 'Best First'.
Re^3: stripped punctuation by fishbot_v2 (Chaplain) on Oct 06, 2005 at 20:50 UTC
Except you want to strip punctuation from the beginning or end. The above regex only works if there is punctuation at both beginning and end. If removing any trailing/leading punctuation is in fact your goal, what about something like: `use strict; use warnings; my $word = 'Wilmer",'; $word =~ s/^ \W? # ignore any leading punc ( \w .? ) # swallow everything lazily (?: \W+ )? $ # ignore any trailing punc /$1/x; print $word;` [download] Update: Mind you, at that point, a much simpler regex will likely serve you better in terms of speed and readability: `$word =~ s/(?:^\W+)\|(?:\W+$)//g;` [download] Final update - benchmark: `Rate capture non_capture capture 16561/s -- -28% non_capture 22861/s 38% --` [download] The second suggestion is about 30% faster, on average. Additionally, `\w` doesn't mean what you think it means.	[reply] [d/l] [select]
Re^4: stripped punctuation by thealienz1 (Pilgrim) on Oct 06, 2005 at 21:37 UTC
I did basically your second regexp there in two steps. I will try the yours, though. I am curious the difference in speed between them. Of course I am wondering what you mean by \w doesn't mean what I think I mean.	[reply]
Re^5: stripped punctuation by fishbot_v2 (Chaplain) on Oct 06, 2005 at 21:45 UTC
from perlre: `A "\w" matches a single alphanumeric character (an alphabetic character, or a decimal digit) or "_"...` [download] Thus your earlier use of `[^\w\d]` had the set of digits in it twice, which suggested to me that you thought that `\w` means `[A-Za-z]`. `[^\w\d]` works, but is redundant and equivalent to `\W` Update: You asked what the speed difference between the two passes and one pass: `s/(?:^\W+)\|(?:\W+$)//g; # versus s/\W+$//g; s/^\W+//g; # my unscientific benchmark Rate single_pass two_pass single_pass 15829/s -- -11% two_pass 17737/s 12% --` [download] Doing it in two passes seems to be about 10-15% faster.	[reply] [d/l] [select]
Re^5: stripped punctuation by Nkuvu (Priest) on Oct 06, 2005 at 21:50 UTC
The \w means "any alphanumeric character or underscore." So in your regex, where you have `[^\w\d]` it's a bit redundant. \w can be replaced by `[a-zA-Z0-9_]` so you're writing `[a-zA-Z0-9_0-9]` in the regexen above. Also note that since \w includes the underscore you're matching more than what you say you want.	[reply] [d/l] [select]


There's more than one way to do things
	PerlMonks