Perl: the Markov chain saw | |
PerlMonks |
Re: Extracting keywords from HTMLby fizbin (Chaplain) |
on Aug 21, 2005 at 18:11 UTC ( [id://485560]=note: print w/replies, xml ) | Need Help?? |
How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"? Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents. Also, I'd suggest some tweaks in your existing code. For example, I'd change get_stop and get_punc as follows:
Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to exists any more: and
Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data:
The fix of course is to change the regular expressions used to normalize the data:
Notice that above I also changed the structure of words_all - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use keys(%{$words_all->{$word}}) to get the list of files a word appears in, and if you need to know the count, you have that too.
In Section
Seekers of Perl Wisdom
|
|