Re: stripped punctuation

You said:

the user define what punctuation they want removed

Presumably, you would have a "default" set (e.g. whatever Perl defines as matching "\W", and maybe "_" as well), which would be suitable in most cases. Actually, a better default might be \P{L} which refers to "all non-letters" (see 'perldoc perlunicode').

If it's important for your application to allow the user to specify a "cusomized" set for some particular case, you face a variety of tricky issues:

Is it easier for the user to specify the particular non-alphanumerics that should be kept, rather than specify all the ones that should be removed? (I think maybe so.)
Many of the characters involved have special meanings in regexes (period, question-mark, plus, dollar-sign and some others). This isn't a killer, but you need to be mindful of it.
If some "non-letter" characters are to be kept when they occur at a word boundary, you might have problems when other "non-letters" (that are to be discarded) co-occur with the ones being kept.
For example, suppose that hyphen is to be kept at word boundaries, but parens at word boundaries should be removed; in a string like " word)- " it will be hard to remove the paren, because it lies "inside" the hyphen, which is being retained; that paren is "word-internal". Maybe the user just needs to specify which non-letter characters to keep when they occur next to a letter (or maybe it's more complicated than that).

Anyway, in the default case, it really can be very simple (and this might even be the quickest):

use strict;

# make up some data
my $line = "('The text.')-- 5 o'clock! What's cookin' with the text da
+ta?";

# split on whitespace, keep only tokens that contain at least on lette
+r
my @words = grep /\p{L}/i, split ' ', $line;

my %wcount;
for ( @words ) {  # using $_ will modify @words "in-place"
    s/^\P{L}+//;  # remove initial non-letters
    s/\P{L}+$//;  # remove final non-letters
    $wcount{lc()}++;  # normalize to lower-case-only
}

print "$wcount{$_}\t$_\n" for ( sort keys %wcount );

__OUTPUT__
1       cookin
1       data
1       o'clock
2       text
2       the
1       what's
1       with
[download]

Note that even though I was using a regex symbol (\P{L}) that is documented as a "unicode" tool for regexes, I can use it on plain old ASCII data. (If you have non-ASCII data, make sure it's in utf8 before processing it -- see Encode if you have non-utf8, non-ASCII text data.)

Comment on Re: stripped punctuation Select or Download Code