ilovechristy has asked for the wisdom of the Perl Monks concerning the following question:

Wise Perl People, I am trying to remove spam from a txt file which looks like the following:
AUTHOR: bobMonk TITLE: Monk Software Chart STATUS: Publish CATEGORY: Monks DATE: 05/21/2006 11:36:16 AM ----- COMMENT: AUTHOR: Joe Legitimate Man EMAIL: Joe@legitdomain.com This is a legitimate comment. ----- COMMENT: AUTHOR: more spam casino stuff EMAIL: marengo@greenfield.com Blabla, more spam. poker blabla ----- COMMENT: AUTHOR: Joe Legitimate Man EMAIL: Joe@legitdomain.com This is a legitimate comment. -----
My goal is to remove the spam comments using regex filters. Comments start from the term "COMMENT:" and end on "-----". I have tried various approaches to this, but none have been successful. Any assistance would be much appreciated. Thank you all.
Buddha bless you.

Replies are listed 'Best First'.
Re: Delete all "records" which contain a regex match
by ikegami (Patriarch) on Jun 05, 2006 at 16:14 UTC
    How can you tell which commment is spam and which one isn't? I made up a few filters in the following code.
    local $/ = '-----'; while (<DATA>) { # Spam if author contains "casino". next if /^AUTHOR:.*casino/msi; # Spam if body contains "poker". next if /^(?!AUTHOR|TITLE|COMMENT|DATE|CATEGORY).*poker/msi; # Spam if anything contains "viagra". next if /viagra/msi; print; }
Re: Delete all "records" which contain a regex match
by liverpole (Monsignor) on Jun 05, 2006 at 15:46 UTC
    Hi ilovechristy,

    You say I have tried various approaches to this.  Can you show us the code you've tried, so we'll know where it is you're getting stuck?  That way, if it's a specific bug you've got, it'll be easier to find and fix it, and if it's just a matter of not knowing what to do at a certain point, we can help you get past the particular hurdle.


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      My apologies. Here are two methods that I have tried that read in a file from stdin:

      Attempt 1:
      #!/usr/bin/perl -w undef $/; while (<>) { s/^COMMENT:\s+AUTHOR: .*?poker.*?^-----$//gsm; print; }
      Attempt 2 (completely nonsensical):
      #!/usr/bin/perl -w #undef $/; while (<>) { next unless (@foo = /^COMMENT:$/ ... /^-----$/); print if ! /poker/; # $_ =~ /poker/; # print $_, "\n"; print @foo; }
      Buddha bless you.

        Hi, ilovechristy, here is one way to do it. Also have a look at perlre.

        use strict; use warnings; local $/; my $data = <DATA>; $data =~ s/COMMENT:(?:(?!-----).)*-----//gs; print $data; output: ------- AUTHOR: bobMonk TITLE: Monk Software Chart STATUS: Publish CATEGORY: Monks DATE: 05/21/2006 11:36:16 AM -----

        Prasad