You could try the approach taken by an online recruitment agency that I heard
about at a trade fair. They allowed their clients to upload their résumé
to the web site and used a search engine to index the incoming documents. Since
the web site itself was accessed via the search engine, it was relatively easy
to append 'and not rude words' to the end of every query.
The search engine used in this example was big and expensive but the technique
of using a search engine in this way has a couple of interesting features that
may be useful to you:
- Search engines ship with filters to flatten numerous document formats into
a text stream. This removes the need to do the work yourself. You can concentrate
on maintaining the rude word lists.
- A good search engines should have a full search language that allows you
to search for words within documents where things like word order and frequency
matter.
- Most search engines use a weighting system that allows you to work out how
well your search fitted the resulting documents. In your case, documents that
score highly could be taken off-line until they can be moderated.
- Some search engines allow you to build stored queries that compare the documents
against queries as the documents are indexed. This allows you to build and
maintain large and complex sets of search queries that can be maintained and
updated off-line.
Implementing a full commercial search engine that can deal with numerous data
formats may be beyond the scope of your current project but similar techniques
can be employed by some of the less expensive search engines. The following
is a useful search engine resource - http://www.searchtools.com/index.html
A surprising resource, in terms of getting a quite comprehensive rude word
list, is Viz Magazine, a funny, rude and generally irreverent UK publication
aimed at students and monks of an open mind. They publish a 'profanisaurus'
which contains over 4000 offensive words and phrases which you could probably
buy off them. I haven't put a link to Viz so that I can't be accused of peddling
smut. If my fellow monks want to read the rude words, you have to make your
own choice and type Viz in to Google!
inman
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.