Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

regex help or pointer to module needed

by Xxaxx (Monk)
on Jun 07, 2004 at 19:52 UTC ( [id://362089]=perlquestion: print w/replies, xml ) Need Help??

Xxaxx has asked for the wisdom of the Perl Monks concerning the following question:

As part of an anti-spam filter I am using a content filter. There is one type of content that only occurs in spam that I would like to use as a trigger for a quick trip to the bit bucket.
asked me about a n<KRRAXH>ew home purcha</SZLNG>se.
I have been working with various regex trying to detect this string. Unfortunately everything I've come up with also triggers some legitimate html and legit XML.

Anyone know of an obvious regex I've missed or perhaps an existing module that can handle this type of expression without false positives?

Thanks.
p.s. I did not want to muddy the waters by posting my unsuccessful code. If anyone needs to know how not to do this right, then I'm your man. :-)

Replies are listed 'Best First'.
Re: regex help or pointer to module needed
by Corion (Patriarch) on Jun 07, 2004 at 20:17 UTC

    I'm using HTML::Tagset and HTML::PullParser to attack this problem. HTML::Tagset contains a list of valid tags for HTML 3.2 and HTML 4 I believe as well. HTML::PullParser is a part of the HTML::Parser package, but its API isn't callback-based but it gives us one token at a time.

    My code is as follows :

    sub check_bogus_html_tags { # now check for bogus tags: my ($body) = @_; my $reason = ""; use HTML::Tagset; use HTML::PullParser; my $p = HTML::PullParser->new( doc => \$body, start => '"S", tagname', end => '"E", tagname', ); my %seen; while (my $token = $p->get_token()) { my ($start,$tag) = @$token; $seen{$tag}++ unless ($HTML::Tagset::isKnown{$tag} ); }; $reason = "Bogus tags " . join(" ",sort keys %seen) . "\n" if (scalar keys %seen > 10 ); };

    Use it as follows:

    # decode the possibly encoded body, either # from MIME-multipart message or from message body $body = unpack_mail_body($mail); # body is HTML # Check the HTML for bad dtds etc. $part_reason .= "wrong inline dtd\n" if $body =~ m#<\s*!\s*[a-z]{1,5}\s*>#mg > 5; $part_reason .= check_bogus_html_tags($body);

    That should be all of it :-)

      Thanks Corion,

      That is a set of new modules for me. I will looking into using them on this problem. I'll let you know if it works out for me.

Re: regex help or pointer to module needed
by Abigail-II (Bishop) on Jun 07, 2004 at 20:52 UTC
    asked me about a n<KRRAXH>ew home purcha</SZLNG>se.
    I have been working with various regex trying to detect this string.

    Uhm, either the answer is trivial, m!asked me about a n<KRRAXH>ew home purcha</SZLNG>se\.! will do, or you want us the guess what it is you really want to match based on a single, trivial example, and no description. Such games don't amuse me.

    Abigail

      Yeah, wouldn't life be grand if they were using the exact same string each and every time.

      Unfortunately that is not the case.

      The length in characters of the "tags" is variable.
      The character base seems to be a..zA..Z0-9 for the most part.
      The order of characters is random. Sometimes they look to be random dictionary words.

      The location of the insert within words is random. The same spam from the same company on the same day is actually unique on each sending.

      One email will have:

      asked me about a ne<jkdwe>w home pur</mFKEWEk>chase
      another will have:
      aske<DFIkdjfd>d me about a new ho</Dklje>me purchase
      and as in the example given first:
      asked me about a n<KRRAXH>ew home purcha</SZLNG>se
      The spam generator seems to take a standard message then insert madeup tags at random.

      I am guessing they are using <string> ... </anotherstring> pairs to avoid an existing filter.

      In actual fact, anything that will match the sentence given without also including legit html and legit xml will probabaly work. That is anything other than matching the exact phase as given.

      In all honesty it hadn't occured to me that someone would think I meant the phrase as given. I'll be more explicit next time.

      If you think further and more complete examples would be helpful I can send some along.

      This is very elusive spam. Especially since it is coming from hacked computers -- hence the return smtp is legit, the ISPs are not on any blacklist so the email envelope is of no help. At least until the poor schmuck who's computer was owned is blacklisted or blocked.

Re: regex help or pointer to module needed
by BrowserUk (Patriarch) on Jun 07, 2004 at 20:25 UTC

    One example is hardly enough to guess what it is about that sample that makes it definitively spam?

    If the determining pattern is that an open tag is follow by text that is followed by a non-matching close tag, then something like this might work.

    m[ < ( [^>]+ ) > .*? </ (?! \1> ) ]x

    There are probably many ways that this could be improved, but it would require more samples to decide how.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      Good suggestion.

      Unfortunately I believe it will match on

      <a href="page.html">Link Text</a>
      I tried expanding this prior to seeking help here with something like:
      m[ < ( [^>\s]+ ) > .*? </ (?! \1> ) ]x
      I hoped the no-space condition would solve things. Unfortunately eBay and Amazon send emails that were caught.

      Still all in all I think this expression along with a white list may be the direction I go for speed.

      Good suggestion.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://362089]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-19 14:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found