stripped punctuation

thealienz1 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: stripped punctuation by Skeeve (Parson) on Oct 06, 2005 at 19:45 UTC
My basic approach would be to remove all punctuation, or better: All non-word characters and start counting then: `s/[^a-z]/ /g; @words= split ' ';` [download] Of course this does not take into account: words that are broken at a line- end ;-) foreign language characters your definition of a word. Maybe you render ab4711xya word. With this it will be 2. `$\=~s;s.;q^\|D9JYJ^^qq^\//\\\///^;ex;print`	[reply] [d/l] [select]
Re^2: stripped punctuation by Nkuvu (Priest) on Oct 06, 2005 at 21:09 UTC
This also breaks for simple, commonly used words. Like "don't." That's two words for you right there. Or "co-worker." Or... I'd suggest that the simplistic approach would be better off by splitting on white space first, then simply removing non-word characters (and I'd use the `[:alpha:]` designation as opposed to a-z).	[reply] [d/l]
Re: stripped punctuation by ambrus (Abbot) on Oct 06, 2005 at 20:58 UTC
Basically, you have to take whitespace-separated words, removing non-alphanumeric (i.e. punctation) characters from the beginning or the end. Punctation is allowed inside a word. What makes this a bit more complicated is that in some languages, the space os omitted around em dashes, so you have to consider em-dashes as whitespace too. Thus, it's easier to define what separates words than what counts as a word. Here's a simple solution. #!perl use warnings; use strict; my # I hate to say this, as it's not mine $text = q{ "'My Dear Mr. Sherlock Holmes:--"Lord Backwater tells me that I may place implicit reliance upon your judgement and discretion. I have determined, therefore, to call upon you and to consult you in referenc +e to the very painful event which has occurred in connection with my wedding. Mr. Lestrade, of Scotland Yard, is acting already in the matt +er, but he assures me that he sees no objection to your co-operation, and that he even thinks that it might be of some assistance. I will call at four o'clock in the afternoon, and, should you have any other engagement at that time, I hope that you will postpone it, as this matter is of paramount importance. Yours faithfully, ST. SIMON.' -- Sir Arthur Conan Doyle, The Adventures of Sherlock Holmes. 1892. }; use locale; my $stuff_between_words = qr{ [^[:alnum:][:space:]]* (?: [[:space:]]+ \| --+ \| ^ ) [^[:alnum:]]* }x; my @words = split $stuff_between_words, $text; shift @words; # remove the empty word from the beginning print join(" ", map "<<$_>>", @words), "\n"; __END__ [download] However, if I'd really need a fast solution, I'd use flex. Update: on second thought, it'd be better to use just `--` instead of `--+`.	[reply] [d/l] [select]
Re^2: stripped punctuation by thealienz1 (Pilgrim) on Oct 06, 2005 at 21:32 UTC
Thanks I will take a look at flex.	[reply]
Re: stripped punctuation by GrandFather (Saint) on Oct 06, 2005 at 19:42 UTC
`use strict; use warnings; my $stripped = qr/["!]/; my $word = '"Wilmer!"'; $word =~ s/^(?:$stripped)+(.*?)(?:$stripped)+$/$1/; print $word;` [download] Perl is Huffman encoded by design.	[reply] [d/l]
Re^2: stripped punctuation by thealienz1 (Pilgrim) on Oct 06, 2005 at 20:20 UTC
After looking at your regexp I took to simplifying my needs with: `$word =~ s/^[^\w\d]+(.?)[^\w\d]+$/$1/;` My intention is remove everything that is not a letter or number up to the first letter, pull everything up till the last non letter or digit. When I look at it it makes sense, but my testing it does not work. Update* It works on the simple example I gave for 'Wilmer!'. I was running word count with a script as the input and the odd results I was seeing were the syntax in the script. I apologize.	[reply] [d/l]
Re^3: stripped punctuation by fishbot_v2 (Chaplain) on Oct 06, 2005 at 20:50 UTC
Except you want to strip punctuation from the beginning or end. The above regex only works if there is punctuation at both beginning and end. If removing any trailing/leading punctuation is in fact your goal, what about something like: `use strict; use warnings; my $word = 'Wilmer",'; $word =~ s/^ \W? # ignore any leading punc ( \w .? ) # swallow everything lazily (?: \W+ )? $ # ignore any trailing punc /$1/x; print $word;` [download] Update: Mind you, at that point, a much simpler regex will likely serve you better in terms of speed and readability: `$word =~ s/(?:^\W+)\|(?:\W+$)//g;` [download] Final update - benchmark: `Rate capture non_capture capture 16561/s -- -28% non_capture 22861/s 38% --` [download] The second suggestion is about 30% faster, on average. Additionally, `\w` doesn't mean what you think it means.	[reply] [d/l] [select]
Re^4: stripped punctuation by thealienz1 (Pilgrim) on Oct 06, 2005 at 21:37 UTC
Re^5: stripped punctuation by fishbot_v2 (Chaplain) on Oct 06, 2005 at 21:45 UTC
Re^5: stripped punctuation by Nkuvu (Priest) on Oct 06, 2005 at 21:50 UTC
Re: stripped punctuation by SamCG (Hermit) on Oct 06, 2005 at 20:00 UTC
Update: Sorry, I realized after posting this is not entirely clear. I'd also strip the punctuation and then do the split on spaces, I just would probably think of the transliteration operator before the substitute (because I wouldn't necessarily be thinking regex), and I'd probably stick with it, thinking it might be faster. Does anyone know if it is? One possibility: `#! perl use strict; $_="Wilmer'?"; tr/a-zA-Z//csd; print;` [download]	[reply] [d/l]
Re: stripped punctuation by radiantmatrix (Parson) on Oct 06, 2005 at 21:17 UTC
GNU's `wc` command exists for any major platform. Simply call it from inside your Perl application and let it do the heavy lifting. <-radiant.matrix-> A collection of thoughts and links from the minds of geeks The Code that can be seen is not the true Code "In any sufficiently large group of people, most are idiots" - Kaa's Law	[reply]
Re^2: stripped punctuation by thealienz1 (Pilgrim) on Oct 06, 2005 at 21:21 UTC
This does a total word count to my understanding. I am looking for individual word counts. I did not mention that though, but thanks anyways.	[reply]
Re: stripped punctuation by graff (Chancellor) on Oct 07, 2005 at 00:52 UTC
You said: the user define what punctuation they want removed Presumably, you would have a "default" set (e.g. whatever Perl defines as matching "\W", and maybe "_" as well), which would be suitable in most cases. Actually, a better default might be `\P{L}` which refers to "all non-letters" (see 'perldoc perlunicode'). If it's important for your application to allow the user to specify a "cusomized" set for some particular case, you face a variety of tricky issues: Is it easier for the user to specify the particular non-alphanumerics that should be kept, rather than specify all the ones that should be removed? (I think maybe so.) Many of the characters involved have special meanings in regexes (period, question-mark, plus, dollar-sign and some others). This isn't a killer, but you need to be mindful of it. If some "non-letter" characters are to be kept when they occur at a word boundary, you might have problems when other "non-letters" (that are to be discarded) co-occur with the ones being kept. For example, suppose that hyphen is to be kept at word boundaries, but parens at word boundaries should be removed; in a string like " word)- " it will be hard to remove the paren, because it lies "inside" the hyphen, which is being retained; that paren is "word-internal". Maybe the user just needs to specify which non-letter characters to keep when they occur next to a letter (or maybe it's more complicated than that). Anyway, in the default case, it really can be very simple (and this might even be the quickest): use strict; # make up some data my $line = "('The text.')-- 5 o'clock! What's cookin' with the text da +ta?"; # split on whitespace, keep only tokens that contain at least on lette +r my @words = grep /\p{L}/i, split ' ', $line; my %wcount; for ( @words ) { # using $_ will modify @words "in-place" s/^\P{L}+//; # remove initial non-letters s/\P{L}+$//; # remove final non-letters $wcount{lc()}++; # normalize to lower-case-only } print "$wcount{$_}\t$_\n" for ( sort keys %wcount ); __OUTPUT__ 1 cookin 1 data 1 o'clock 2 text 2 the 1 what's 1 with [download] Note that even though I was using a regex symbol (\P{L}) that is documented as a "unicode" tool for regexes, I can use it on plain old ASCII data. (If you have non-ASCII data, make sure it's in utf8 before processing it -- see Encode if you have non-utf8, non-ASCII text data.)	[reply] [d/l] [select]
Re: stripped punctuation by thundergnat (Deacon) on Oct 07, 2005 at 01:31 UTC
Some things which often fail to get taken into account: Words with internal apostrophes. (don't won't, can't shouldn't, you'll, it's, etc.) Words with non ASCII characters. �, �, �, etc. This does: ########################################################## #! /usr/bin/perl use warnings; use strict; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; while (<DATA>) { my $line = lc $_; while ($line =~ /($word('$word)?)/g){ $count{$1}++; } } printf "%15s %5d\n", $_, $count{$_} for sort keys %count; __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "Words with internal apostrophes. (don't won't, can't shouldn't, you'l +l, it's, etc.)" "Se�or Mont�ya's res�me isn't �propos." [download]	[reply] [d/l]
Re^2: stripped punctuation by thealienz1 (Pilgrim) on Oct 07, 2005 at 05:34 UTC
The ultimate case would be to handle a word count in multiple languages, btu I believe that there are too many factors to take into account. Thanks for the suggestion though.	[reply]
Re: stripped punctuation by Moron (Curate) on Oct 07, 2005 at 16:54 UTC
Is it me or is it really as simple as: `$word =~ s/_//g; $word =~ /(\w+)/ and $word = $1;` [download] Friday night (=rapidly evaporating) minds vaguely want to know! -M Free your mind	[reply] [d/l]


"be consistent"
	PerlMonks