thealienz1 has asked for the wisdom of the Perl Monks concerning the following question:
Greetings,
I have written a basic word count script. It incoporates many options that can be used to determing a word, divied a word, remove punctuation, etc... It works, but I am afraid that its not efficient in one aspect.
As of right now, when a word is found in a document punctuation that appears at the beginning and end of the word can be removed. With the user define what punctuation they want removed in a string divided by a space.
#!/usr/bin/perl
my $stripped = '" !';
my $word = '"Wilmer!"';
foreach my $punc (split(/ / , $stripped)) {
#print "punc: $punc\n";
$word =~ s/^$punc//;
$word =~ s/$punc$//;
}
print $word;
I want to know what other monks would do. What is the most efficient method that you can think of? Or perhaps lead me into the right direction, so that I can right the code myself because it is my project and not yours.
Re: stripped punctuation
by Skeeve (Parson) on Oct 06, 2005 at 19:45 UTC
|
My basic approach would be to remove all punctuation, or better: All non-word characters and start counting then:
s/[^a-z]/ /g;
@words= split ' ';
Of course this does not take into account:
- words that are broken at a line-
end ;-)
- foreign language characters
- your definition of a word. Maybe you render ab4711xya word. With this it will be 2.
$\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
Re: stripped punctuation
by ambrus (Abbot) on Oct 06, 2005 at 20:58 UTC
|
Basically, you have to take whitespace-separated words, removing non-alphanumeric (i.e. punctation) characters from the beginning or the end. Punctation is allowed inside a word. What makes this a bit more complicated is that in some languages, the space os omitted around em dashes, so you have to consider em-dashes as whitespace too. Thus, it's easier to define what separates words than what counts as a word. Here's a simple solution.
#!perl
use warnings;
use strict;
my # I hate to say this, as it's not mine
$text = q{
"'My Dear Mr. Sherlock Holmes:--"Lord Backwater tells me that I may
place implicit reliance upon your judgement and discretion. I have
determined, therefore, to call upon you and to consult you in referenc
+e
to the very painful event which has occurred in connection with my
wedding. Mr. Lestrade, of Scotland Yard, is acting already in the matt
+er,
but he assures me that he sees no objection to your co-operation,
and that he even thinks that it might be of some assistance. I will
call at four o'clock in the afternoon, and, should you have any other
engagement at that time, I hope that you will postpone it, as this
matter is of paramount importance. Yours faithfully, ST. SIMON.'
-- Sir Arthur Conan Doyle, The Adventures of Sherlock Holmes. 1892.
};
use locale;
my $stuff_between_words = qr{
[^[:alnum:][:space:]]*
(?: [[:space:]]+ | --+ | ^ )
[^[:alnum:]]*
}x;
my @words = split $stuff_between_words, $text;
shift @words; # remove the empty word from the beginning
print join(" ", map "<<$_>>", @words), "\n";
__END__
However, if I'd really need a fast solution, I'd use flex.
Update: on second thought, it'd be better to use
just -- instead of --+.
| [reply] [d/l] [select] |
|
Thanks I will take a look at flex.
| [reply] |
Re: stripped punctuation
by GrandFather (Saint) on Oct 06, 2005 at 19:42 UTC
|
use strict;
use warnings;
my $stripped = qr/["!]/;
my $word = '"Wilmer!"';
$word =~ s/^(?:$stripped)+(.*?)(?:$stripped)+$/$1/;
print $word;
Perl is Huffman encoded by design.
| [reply] [d/l] |
|
After looking at your regexp I took to simplifying my needs with:
$word =~ s/^[^\w\d]+(.*?)[^\w\d]+$/$1/;
My intention is remove everything that is not a letter or number up to the first letter, pull everything up till the last non letter or digit. When I look at it it makes sense, but my testing it does not work.
Update
It works on the simple example I gave for 'Wilmer!'. I was running word count with a script as the input and the odd results I was seeing were the syntax in the script. I apologize.
| [reply] [d/l] |
|
Except you want to strip punctuation from the beginning or end. The above regex only works if there is punctuation at both beginning and end.
If removing any trailing/leading punctuation is in fact your goal, what about something like:
use strict;
use warnings;
my $word = 'Wilmer",';
$word =~ s/^ \W*? # ignore any leading punc
( \w .*? ) # swallow everything lazily
(?: \W+ )? $ # ignore any trailing punc
/$1/x;
print $word;
Update: Mind you, at that point, a much simpler regex will likely serve you better in terms of speed and readability:
$word =~ s/(?:^\W+)|(?:\W+$)//g;
Final update - benchmark:
Rate capture non_capture
capture 16561/s -- -28%
non_capture 22861/s 38% --
The second suggestion is about 30% faster, on average.
Additionally, \w doesn't mean what you think it means.
| [reply] [d/l] [select] |
|
|
|
Re: stripped punctuation
by SamCG (Hermit) on Oct 06, 2005 at 20:00 UTC
|
Update: Sorry, I realized after posting this is not entirely clear. I'd also strip the punctuation and then do the split on spaces, I just would probably think of the transliteration operator before the substitute (because I wouldn't necessarily be thinking regex), and I'd probably stick with it, thinking it might be faster. Does anyone know if it is?
One possibility:
#! perl
use strict;
$_="Wilmer'?";
tr/a-zA-Z//csd;
print;
| [reply] [d/l] |
Re: stripped punctuation
by radiantmatrix (Parson) on Oct 06, 2005 at 21:17 UTC
|
GNU's wc command exists for any major platform. Simply call it from inside your Perl application and let it do the heavy lifting.
| [reply] |
|
This does a total word count to my understanding. I am looking for individual word counts. I did not mention that though, but thanks anyways.
| [reply] |
Re: stripped punctuation
by graff (Chancellor) on Oct 07, 2005 at 00:52 UTC
|
You said:
the user define what punctuation they want removed
Presumably, you would have a "default" set (e.g. whatever Perl defines as matching "\W", and maybe "_" as well), which would be suitable in most cases. Actually, a better default might be \P{L} which refers to "all non-letters" (see 'perldoc perlunicode').
If it's important for your application to allow the user to specify a "cusomized" set for some particular case, you face a variety of tricky issues:
- Is it easier for the user to specify the particular non-alphanumerics that should be kept, rather than specify all the ones that should be removed? (I think maybe so.)
- Many of the characters involved have special meanings in regexes (period, question-mark, plus, dollar-sign and some others). This isn't a killer, but you need to be mindful of it.
- If some "non-letter" characters are to be kept when they occur at a word boundary, you might have problems when other "non-letters" (that are to be discarded) co-occur with the ones being kept.
For example, suppose that hyphen is to be kept at word boundaries, but parens at word boundaries should be removed; in a string like " word)- " it will be hard to remove the paren, because it lies "inside" the hyphen, which is being retained; that paren is "word-internal". Maybe the user just needs to specify which non-letter characters to keep when they occur next to a letter (or maybe it's more complicated than that).
Anyway, in the default case, it really can be very simple (and this might even be the quickest):
use strict;
# make up some data
my $line = "('The text.')-- 5 o'clock! What's cookin' with the text da
+ta?";
# split on whitespace, keep only tokens that contain at least on lette
+r
my @words = grep /\p{L}/i, split ' ', $line;
my %wcount;
for ( @words ) { # using $_ will modify @words "in-place"
s/^\P{L}+//; # remove initial non-letters
s/\P{L}+$//; # remove final non-letters
$wcount{lc()}++; # normalize to lower-case-only
}
print "$wcount{$_}\t$_\n" for ( sort keys %wcount );
__OUTPUT__
1 cookin
1 data
1 o'clock
2 text
2 the
1 what's
1 with
Note that even though I was using a regex symbol (\P{L}) that is documented as a "unicode" tool for regexes, I can use it on plain old ASCII data. (If you have non-ASCII data, make sure it's in utf8 before processing it -- see Encode if you have non-utf8, non-ASCII text data.) | [reply] [d/l] [select] |
Re: stripped punctuation
by thundergnat (Deacon) on Oct 07, 2005 at 01:31 UTC
|
Some things which often fail to get taken into account:
Words with internal apostrophes. (don't won't, can't shouldn't, you'll, it's, etc.)
Words with non ASCII characters. á, ñ, ÿ, etc.
This does:
##########################################################
#! /usr/bin/perl
use warnings;
use strict;
my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/;
my %count;
while (<DATA>) {
my $line = lc $_;
while ($line =~ /($word('$word)?)/g){
$count{$1}++;
}
}
printf "%15s %5d\n", $_, $count{$_} for sort keys %count;
__DATA__
"Hello World!"
"Oh poor Yorick, his world I knew well yes I did"
"Words with internal apostrophes. (don't won't, can't shouldn't, you'l
+l, it's, etc.)"
"Señor Montóya's resüme isn't ápropos."
| [reply] [d/l] |
|
| [reply] |
Re: stripped punctuation
by Moron (Curate) on Oct 07, 2005 at 16:54 UTC
|
Is it me or is it really as simple as: $word =~ s/_//g;
$word =~ /(\w+)/ and $word = $1;
Friday night (=rapidly evaporating) minds vaguely want to know!
| [reply] [d/l] |
|
|