joealba has asked for the wisdom of the Perl Monks concerning the following question:

Hi all! Recently I found some older pages on my site which had outdated links. But, these links weren't going to 404 pages -- they were going to porn pages! Their domain names had lapsed, and some naughty sites snapped them up. It is a Bad Thing (tm) when your state newspaper's site has links to porn.

So, I'm writing a link checker which will scan all the pages on the site for external links, download each page with LWP, and attempt to scan for inappropriate stuff. I can then make it output a list of questionable pages which can be checked by a human.

To start, I'll probably use TheDamian's Regexp::Common $RE{profanity}. Then, I'll try to scan for other words/phrases commonly associated with naughty sites - qr{\b(xxx|porn|warez|sheep)\b}.

Does anyone else have some good ideas on how to make this program a little more robust in its search, without returning too many misses?

Replies are listed 'Best First'.
(Ovid) Re: Checking external links for inappropriate content
by Ovid (Cardinal) on Feb 14, 2002 at 22:41 UTC

    joealba asked:

    Does anyone else have some good ideas on how to make this program a little more robust in its search, without returning too many misses?

    Yes. Forward all of the images to me and I'll let you know if they are innapropriate.

    Now, surprising as it may seem, I don't know a lot about the online porn industry (yet another area of future research, I suppose), but I am suspecting that they probably will redirect from the innocuous names to the suspect ones. Thus, you'll probably want to check for redirects. If you're using LWP::Simple, be careful. For example:

    perl -MLWP::Simple -e "getprint(q|http://www.ovidinexile.com/|)"

    The above code will print out HTML for a frameset. However, if you use Rex Swain's HTTP viewer, you discover that you are redirected to my real home page. I think a redirect should definitely be something you want to flag, even if the Russian words on the new site don't trigger your regexes :)

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      NICE! I like the idea of returning every link that results in a redirect. Thanks, Ovid!

      So I should set up an HTTP::Request, pass it to an LWP::UserAgent, and check the $response->{_rc} response code? Or is there an easier way?
        Something like this may help...
        if ($res->is_success) { #Normal retrievel of content stuff }else{ #check for redirects if ($res->code() =~ /30[12]/){ #redirect codes (temp/perm) #grab the location my $remote_cgi = $res->header('Location'); { # Some servers erroneously return a relative URL for redirects, # so make it absolute if it not already is. local $URI::ABS_ALLOW_RELATIVE_SCHEME = 1; my $base = $res->base; $remote_cgi = $HTTP::URI_CLASS->new($remote_cgi,$base)->abs($ba +se); } }else{ # Request failed normaly, broken link }
        Where $res is your result object

        ---If it doesn't fit use a bigger hammer
Re: Checking external links for inappropriate content
by Malach (Scribe) on Feb 14, 2002 at 22:51 UTC

    Note: I'm not saying that you're wrong. At all.

    I'd be inclined to take a different tack on this.

    Script to check the last modfied date/time on external links, and if changed since last checked give the list for a human to check.

    Of course, there are issues with getting the last modified accurately, but I imagine that they're more solvable than parsing for content.

    Perhaps each page has a certain string you can check for to make sure it's unchanged?

    Hope the different viewpoint helps.

    Malach
    So, this baby seal walks into a club.....

      That's another good idea, but most (if not all) of our external links will be updated quite often. We just don't have the manpower to check every link every time it is updated.

      Besides... why have people do the work that a few well-planned regexps can do? :) Thanks, though!
        Well... you could however have your script only check those pages that changed... and as a safety net check pages that haven't reported changes on a less frequent basis... to make it so your script doesn't run forever.

                        - Ant
                        - Some of my best work - (1 2 3)

Re: Checking external links for inappropriate content
by rjray (Chaplain) on Feb 14, 2002 at 22:45 UTC

    It sounds like you have thought out the problem pretty well. You already make reference to using the modules and elements I'd normally recommend. And I'm glad to hear that this isn't a case of knee-jerk reactionism, that you plan on having the "hits" reviewed by human eyes before axing them.

    I'd say charge forward with the plan you have, and let us know if you run into any difficulty code-wise.

    Plus, I'm tempted to give you a ++ vote just for the sheep reference in your "naughty sites" RE :-).

    --rjray

Re: Checking external links for inappropriate content
by little (Curate) on Feb 14, 2002 at 22:49 UTC
    What about firing up a very simple script on a machine behind a squid proxy?
    Teh proxy can and will if set to do so deny all access to such pages (hackers, porn, warez, etc.) with a nice big red "ACCESS DENIED" image page. But hey, it blocks doubleclick.com as well :-)
    Anyhow, could it be simpler?

    Have a nice day
    All decision is left to your taste