in reply to Simple Text Indexing
Couple of tiny comments on your code:
my @words = split /\s/, $line;
Might have problems if there are multiple spaces, spaces and tabs, so
my @words = split /\s+/, $line;
Would be better, surely? Is that what your removeNullEntries is about?
I ended up coming up with a complex regex to get what I thought were "words" out of text, something like
rather than just grabbing strings seperated by whitespace and trying to figure out if they're really valid words later./\w[\w'-]*\w|\w+/
And
Seems like it would be better off as a hash so you can just go if(defined($stoplist{$word})).my @stopList = ("the", "a", "an", "of", "and", "on", "in", "by", "with", "at", "he", "after", "into", "their", "is", "that", "they", "for", "to", "it", "them", "which");
($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print
|
|---|