Re: Content "Censorshop" : Kid friendliness
by theorbtwo (Prior) on Sep 20, 2003 at 15:09 UTC
|
As to the first and last on your list: Punt. Get somebody else to make a list, then follow it religiously. Get somebody else to point the finger of blame at. There are lists; the most important of which is probably the "seven dirty words" supreme court rulling in the US, and probably similar things in other countries. See also Regex::Common, which includes a regex for testing them.
Non-plaintext text-like things can, for the most part, be scanned by semi-automatic means -- there's PDF parsers on CPAN, msword can be scanned normaly -- strings will show you the text -- though sometimes you will catch cases where there was "bad" language that was later deleted, because word tends to append, at least in "fast save" mode.
Unfornatly, doing images requires hardcore AI. Beyond that, it's completly impossible. For example, a picture of a woman's breasts is acceptable in some countries, in some, it's acceptable if talking about medical things, in some, it's simply unacceptable. For that matter, some places, pictures of women's /faces/ are obscene. Other places, pictures of people at all are considered graven images. The last sort probably don't have computers, though, because they most likely consider them evil.
If you want to do this generaly, you have to moderate by trusted moderaters. Sorry.
Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).
| [reply] |
|
|
Absolutely right, you punt.
The problem is that real content filtering is
demonstrably AI-complete, which in layman's terms
means computers aren't smart enough to do it.
Keyword filtering falls flat on its face: you
end up filtering out stuff you don't want to filter
out and leaving in stuff that's obviously obscene.
Nevertheless, this is the kind of filtering you want
to do, because all the other kinds are worse. If
possible, involve a human in the process, if only by
having any posts with medium-class "gray" words need
to be approved by a moderator before being publically
viewable. Some words you can get away with just
banning entirely, but others (e.g., nipple) are very
problematic; you end up blocking conversation about
baby bottles and engine mechanics; these you probably
want to greylist and pass through a moderator. Also
be aware that no matter what words you block, people
who want to make sexual inuendo will do so; the only
way to fix that is to pass *everything* through a human
moderator.
If you can't punt to human moderators, wheedle, cajole,
trick, or coerce someone
else into giving you a list of words to block. It's an
impossible task to get the list right, and you DO NOT
want to have responsibility for the list rest on you.
But if you can punt in realtime to human moderator(s),
that's better. Blacklist the big bad four-letter words
whose only meaning involves excrement or intercourse,
and put any other nasty words on a greylist that flags
the post for a moderator to examine and approve or
disapprove. That way you don't end up blocking
conversation about cancer, Mr. Gephardt, and so on.
And if you can get the mods to also look at other
posts, even if only to spot-check them from time to
time, do so, because you WILL have a few idiots
who think it's their job to find any dirty words or
phrases that your filters ignore and use them in
every post.
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
| [reply] [d/l] |
|
|
No doubt a computer could filter out a list of certain words. I can't imagine how any site with public postings can be made "family friendly", without a human monitoring each photo or post. No doubt whatever standard you uphold, you will offend someone. I would wager if you check back in a couple decades a computer could filter some photos.
Also, the most offensive words often have no foul language. One can easily imagine very sexually explicit language using non sexual normal objects. Likewise to take an example from Mark Twain, I believe many people would find this poem extremely offensive. Others would consider it educational for youngsters. Yet, how could a computer determine it was offensive when humans can't agree??? Perhaps only by filtering a reference to God (since any content mentioning God will offend some sect?)
http://www.lone-star.net/mall/literature/warpray.htm
| [reply] |
Re: Content "Censorshop" : Kid friendliness
by tachyon (Chancellor) on Sep 20, 2003 at 16:22 UTC
|
You could do a lot worse than consider some variation on the XP approach used here, ie encourage the userbase to self moderate and admonish the unbelievers who dare stray from the one true way. Perhaps a click and rate system where anything that goes heavily negative gets a quick review from you. Alternatively a user that gets lots of -- on their stuff could be disabled/deleted/spanked vigorously.
We do web filtration and have plenty of lists of offensive words. Drop me a line privately and I will email you a zip. The porn word list is too offensive to post publically, as well as being quite long. cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] |
Re: Content "Censorshop" : Kid friendliness
by herveus (Prior) on Sep 20, 2003 at 17:14 UTC
|
Howdy!
Man, you could be sooo screwed.
1. What *exactly* qualifies as "family and kid friendly"? This one is tougher to answer than it at first appears.
That's an understatement. My take is that the people who
are most adamant about "family and kid friendly" have a
particular axe to grind that is far more restrictive than
one might expect from a broad survey. Who is defining
"family and kid friendly"? For your own sake, you need
a usable definition to cover your butt.
3. How can non-text items like .jpg, .doc, .pdf, etc. be checked, if they can in fact be checked in some automated fashion. This just seems doubtful to me in my current state of ignorance.
Automated checking is a "hard problem". I like the
suggestion of community evaluation. It then becomes a
direct application of "community standards" in the most
useful sense.
4. Are there online "bad-word" lists? If so, what about other languages and localization? what about slang? innuendo?
Community policing will address this using the local
standards instead of some arbitrary, external "standard"
(and I'm using scare quotes there).
Good luck with this. Perhaps you could do something with
the Everything Engine.
yours,
Michael | [reply] |
|
|
Being sooo screwed is what I am trying to avoid ;-)and it does seem to be one of those confounded issues, that no matter what you do, someone will find fault with it.
Also, it seems I may not have been clear on the source of the requirement. It's a requirement that I am placing on the project, not someone else, at least not explicitly and more on what I believe my users would expect. Ultimately, I want to be able to legitimately and in good conscience market the product/service as "family and kid friendly".
...you need a usable definition to cover your butt.
Indeed, and being new to potentially publishing something like this, that is exactly what I am trying to do, but am not sure of where exactly to start. I'm certainly no prude, and as a parent I know what I do and do not want my kids exposed to at any given age... but that's just me. I have friends on either end of the spectrum from my position.
So, my ultimate goal would be to satisfy those within 1-2 std deviations of some norm, or possibly just not worring too much about the outlying points on some distribution of values. Of course this begs the question what is the NORM. Well in short I don't know and don't think I can know for all (probably not even most) "communities".
So, all this really does lend a lot of credence to using the community policing/XP model. Oddly enough this had not occured to me prior to reading the responses, so many thanks. However, it would seem that I could be leaving my self open if there were not some "minimal" standard that is first imposed; then allowing the "community" to take it further as it deems necessary. Perhaps, filtering on explicit foul/obscene language and then letting the users flag other things for removal via community policing.
My main concern with the pure community policing model is latency. How long will something be there and available to kids before someone tags it and gets it removed?
I also like the suggestion of a trusted moderator. I think that would be a must for the community model. Otherwise someone on an extreme end could overly impose their view by "burning every book in the library" in a manner of speaking.
Thanks for all the help so far, I definitly have my work cut out for me.
BTW: "Everything Engine", I had not heard of it. I found the "Everything Development Engine" at everydevel.com, which looks like it could be quite useful down the road, thanks. Right now, I just have my hands full defining the requirements and definitions for the project.
| [reply] |
|
|
This page brought to you by the crazy folks at The Everything Development Company and maintained by Tim Vroom.
| [reply] |
|
|
Re: Content "Censorshop" : Kid friendliness
by adrianh (Chancellor) on Sep 20, 2003 at 19:48 UTC
|
My advice would be to forget about getting a computer to do the monitoring. You need to hire somebody to act as an editor / monitor / moderator. Seriously.
I was involved with running a game based web site for a large-cereal-producing-company for a couple of years. While some basic filters can help you, there is no avoiding having a human involved. We had somebody spend about 1/4 of there working week on monitoring this site - and it had few areas where generic text could be input.
Languages can be a real problem. We had a user running high in our site league for a couple of weeks whose user name was, if I recall correctly, the Swedish slang for... well... something not suitable for the under 14s the site was aimed at :-)
Rather than spend a lot of effort trying to prevent stuff getting up it may be more effective to have mechanisms to remove unsuitable content quickly.
The users are your best friend. Have a prominent area for feedback that goes to a real human who deals with problem quickly. Do not have a mailbox that somebody scans once a week. Somebody should be looking in there every few hours. Do this religously. Nothing will annoy a parent more than they're complaint going ignored for a couple of days.
Have a clear statement on the site of what you consider suitable behaviour, and that action will be taken to remove unsuitable content.
Oh yes. Find a lawyer. Talk to them about what you have to legally do if you're dealing with minors. Data protection laws, etc. are often different. Talk to them about the legal responsibilities if you're moderating content. Allowing moderation can, in some circumstances, make you legally responsible for the things that are said on the site (as ever IANAL).
| [reply] |
|
|
Thanks, this all sounds like excellent advice. The moderator seems like a must, however, it will probably have to be myself and my partners in crime that deal with it at first. Just a $ issue.
The users are your best friend
Excellent point. I need to get out of the mode of ONLY thinking about how some user can abuse the program/system.
Allowing moderation can, in some circumstances, make you legally responsible for the things that are said on the site
Hummm, I could see that being the case and it's somewhat unnerving. However, does this imply the opposite if you do not have moderating? Yah, I know, get a lawyer. ;-)
| [reply] |
Re: Content "Censorshop" : Kid friendliness
by artist (Parson) on Sep 20, 2003 at 16:10 UTC
|
With the current technology and states of human minds, you cannot filter out 100%. So I would suggest involve your users in the process as well.
- Automatic ( via programs, filters)
- Manually
- User Votings (via programs)
- User Comments (Not to be displayed on site)
{artist} | [reply] |
Re: Content "Censorshop" : Kid friendliness
by naChoZ (Curate) on Sep 20, 2003 at 17:46 UTC
|
| [reply] |
Re: Content "Censorshop" : Kid friendliness
by parasew (Beadle) on Sep 21, 2003 at 04:00 UTC
|
maybe you want to take a look at GIFT, it could assist
you in some keyword retrieval from images
The GIFT (the GNU Image-Finding
Tool) is a Content Based Image Retrieval System (CBIRS). It enables you
to do Query By Example on images, giving you the opportunity to improve
query results by relevance feedback. For processing your queries the
program relies entirely on the content of the images, freeing you from
the need to annotate all images before querying the collection.
| [reply] |
Re: Content "Censorshop" : Kid friendliness
by inman (Curate) on Sep 22, 2003 at 16:41 UTC
|
You could try the approach taken by an online recruitment agency that I heard
about at a trade fair. They allowed their clients to upload their résumé
to the web site and used a search engine to index the incoming documents. Since
the web site itself was accessed via the search engine, it was relatively easy
to append 'and not rude words' to the end of every query.
The search engine used in this example was big and expensive but the technique
of using a search engine in this way has a couple of interesting features that
may be useful to you:
- Search engines ship with filters to flatten numerous document formats into
a text stream. This removes the need to do the work yourself. You can concentrate
on maintaining the rude word lists.
- A good search engines should have a full search language that allows you
to search for words within documents where things like word order and frequency
matter.
- Most search engines use a weighting system that allows you to work out how
well your search fitted the resulting documents. In your case, documents that
score highly could be taken off-line until they can be moderated.
- Some search engines allow you to build stored queries that compare the documents
against queries as the documents are indexed. This allows you to build and
maintain large and complex sets of search queries that can be maintained and
updated off-line.
Implementing a full commercial search engine that can deal with numerous data
formats may be beyond the scope of your current project but similar techniques
can be employed by some of the less expensive search engines. The following
is a useful search engine resource - http://www.searchtools.com/index.html
A surprising resource, in terms of getting a quite comprehensive rude word
list, is Viz Magazine, a funny, rude and generally irreverent UK publication
aimed at students and monks of an open mind. They publish a 'profanisaurus'
which contains over 4000 offensive words and phrases which you could probably
buy off them. I haven't put a link to Viz so that I can't be accused of peddling
smut. If my fellow monks want to read the rude words, you have to make your
own choice and type Viz in to Google!
inman
| [reply] |
Re: Content "Censorshop" : Kid friendliness
by johndageek (Hermit) on Sep 22, 2003 at 14:28 UTC
|
Wow! have you go tour work cut out for you.
Just my 2 cents worth:
1) gather a broad sample of pages from basic text similar to dictionary definiftions (ranging from words like apple through basic obscenities) as well as pages with pop ups, and standard graphical web pages.
2) have potential member/audience parents rate the pages from 1-suitable for all ages through 18-must be at least this old to look at this trash. And comment as to why it is not fit for a younger age group.
This will give you a basis (LOL) to see what might be suitable for each age group. Now comes the fun part - enforcement. Many words have acceptable use and meaning in the correct context, (ignore graphics for a moment), in another context they are totally unacceptable.
If you are rating sites, how do you prevent/react to changes in content? Once you over ride your automated systems to allow a site because certain words are used in an acceptable context, the page is allowed regardless of content changes.
An interesting thing I have run accross, is that filters stopping "bad word sites" will catch a bad word, then block the site. Here is where ads/popups can be interesting, say the word sex is banned for our discussion here. We go to the site and do our work (today's ad is for beer - not a forbidden word). The next time we go to the site the ad has a check box sex - M,F ( a banned word) the site does not come up. If the site is added to a blocked list, it will never be usable again until it is cleared by a moderator, if it is scanned each time processing cost goes up, but the user is frustrated because they can use the site, sometimes, and are left wondering why
Good luck
dageek | [reply] |
Re: Content "Censorshop" : Kid friendliness
by nimdokk (Vicar) on Sep 24, 2003 at 13:21 UTC
|
My suggestion would be to use a combination of scripts that can check for possibly offensive language as well as people who can check things as well. What is offensive to one, might not be offensive to another. You might take a look at how the userfriendly.org message boards are handled. There are moderators who are regular users but have been around long enough to have a good feel for the particular community. They use a standard along the lines of "would you want your (hypothetical) 14-year old sister to read/hear this?" It seems to work reasonably well (or did when I was regularly posting there). You might want to start with a very restrictive policy and then ease up on it as the community develops. Also, I'd say let the community that develops decide how best to police itself.
Just my .02 cents
"Ex libris un peut de tout" | [reply] |