Noise words, important words, etc. tend to be domain-specific. What I do for my current project is for every search, I log:
- what was typed (e.g. "show me all the foo and bar")
- what i searched on (e.g., "foo bar") # we use Lingua::Stem and other tricks
- how many "hits"
This is written to a log file and a cron job dumps results into mysql db for easy reporting.
So to finally answer your question, you determine noise words by looking at what your users do. HTH