True, but if I'm trying to index english text, the oddities of punctuation may make a split impractical. Unless you've got a suggestion on an iterative split? | [reply] |
If you just want words, why bother with punctuation at all?
Just do a massive s/\W+/ /g on the string
beforehand and you'll get a big list of words, separated by
spaces. I suppose those damn apostrophies will cause you
pain, and you want "it's" to differ from "its". It's unclear
whether or not capitalization matters-- is "BASIC" a different
word from "basic"? What about "Smith" versus "smith"?
Anyway, I'd probably write something like this:
local $/ = undef;
$_ = <MY_FILE>;
my %hash = ();
$hash{$_}++ foreach split /[^\w']+/; # Change $_ to lc if case matters
-Ted | [reply] [d/l] [select] |