mull has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a little function that reads a file and extracts quotes to add to a pool of quotes that might be randomly placed at the bottom of a web page that I maintain for a small club. I have discovered that some of the quotes are a little 'off-color', that is, they could be offensive, and this page is hosted on a student webserver. I can only imagine the uproar if the wrong person read the wrong quote, and took it out of context...

So anyway, I wrote a very simple little regex that skips the most inflamitory quotes by doing alternation:

foreach (<FH>){ if(/^[0000-9999](.+)/){ next if /smurf|fork/; push( @witty_quotes, substr( $_, 5) ); } else { next }; }
This works fine, but I am looking to learn a lesson here about a better way to do things... What I would like to be able to do is have a list of "bad words", and write a regex that would skip to the next if the string contained any one of the list members. It seems like I'm going about it the wrong way here, since I have to go in and add |foo|bar|baz... to the match every time I spot another word someone doesn't like.

So I am looking for a good way to test to see if a string contains a match for any member of a list.

I have been trying some different things on my own, but I would really like to see how others might go about this problem. Thanks!

Replies are listed 'Best First'.
Re: test if a string contains a list member
by DrManhattan (Chaplain) on Oct 20, 2001 at 22:13 UTC
    You can build your regex on the fly:
    my @bad = qw(fork smurf); my $regex = join('|', @bad); $regex = qr/$regex/; foreach (<FH>){ if(/^[0000-9999](.+)/){ next if /$regex/; push( @witty_quotes, substr( $_, 5) ); } else { next }; }
    Some other notes: <list>
  • Your quote is already stored in $1 after the first pattern match, so you don't need the substr().
  • The else { next }; after the if() is redundant.
  • /^[0000-9999](.+)/ doesn't do what you think it does. :) [0000-9999] is the same as [0-9].
  • </list> You can rewrite your code like so:
    use strict; my @witty_quotes; my @bad = qw(fork smurf); my $regex = join('|', @bad); $regex = qr/$regex/; # open() FH somewhere foreach (<FH>){ # Note: it's important to put this regex test # outside of the if() block to ensure that $1 # below comes from the correct pattern match next if /$regex/; if(/^\d{4}(.+)/){ push( @witty_quotes, $1); } }

    -Matt

      Thanks again for all the help!

      I went over my script in light of all the excellent comments, and now I have something I think is much nicer.

      my @offensive_words = qw( smurf fork rake ); my $bad = join('|', @offensive_words); $bad = qr($bad); ... open FH, $filename; while(<FH>){ /($bad)/oi and push( @rude_quotes, $1) and next; /^\d{4}-(.+)/os and push( @witty_quotes, $1 ); }

      So now I get what I wanted, and also nice array of rude quotes which is sure to come in handy...
      Thanks again!

        While not a point that is particularly relevent in your situation it should be kept in mind that this approach has limitations. It doesnt scale that well because of the way the regex engine works and the simple conversion of the banned list to a regex would have problems with various regex reserved characters, 'SH|T' would blow it for instance.

        A more sophisticated approach might be to keep a hash of banned words with associated hand written regexes to match them. On the fly you could either match against each in turn, maximizing the optimizations available to the regex engine. Or more simply cat them all together as you are doing here, but at least you would have the certainty of knowing the regex fragment used would be correct (as you can make it)

        Again I relise this might be too much for this particular situation, but its worth considering, you'd be suprised where bugs from this type of approach show up. The other day I was playing with HTML::TableExtract that uses a very similar mechanism to scan for table column headers. It failed very oddly when a parenthesis or | was in the header name. Oddly enough that it took me a while to track down... ;-)

        Yves
        --
        You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Re: test if a string contains a list member
by Anonymous Monk on Oct 20, 2001 at 21:33 UTC
    my @badwords = qw(smurf fork larf); my $bwregex = join('|', @badwords); ... next if(/$bwregex/o);

      Depending on the needs, this method is a bit too strict, as it will reject even compound words that contain a offensive word as a subword.

      A way that could also be worked around, but which would create less false positives, would be to make sure each offensive word is used as a word alone :

      my $badwordsRE = "\\b" . join( "\\b|\\b", @badwords ) . "\\b"; reject if /$badwordsRE/o;

      My advice to the original poster is to get a list of offensive words from the web master administrator or some other person responsible for that thing and use that list as a ban list - so he is clean with them. If there is no such list, then I wouldn't censor at all.

      perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
        It might be a good idea to use the /i modifier as well - it's OK to block 'smurf' and 'fork', but what about 'SMURF'...

        Cheers,

        JJ

Re: test if a string contains a list member
by mull (Monk) on Oct 20, 2001 at 21:55 UTC
    Thanks; that helped point me in the right direction. :)