analyzing spam patterns

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Lets say the following template is used to generate spam messages

 {I have|I've} been {surfing|browsing} online more than {three|3|2|4} 
+hours today, yet I never found any interesting article like yours. {I
+t's|Itis} pretty worth enough for me. {In my opinion|Personally|In my
+ view}, if all {webmasters|site owners|website owners|web owners} and
+ bloggers made good content as you did, the {internet|net|web} will b
+e {much more|a lot more}useful than ever before.|
[download]

And you have already identified a set of generated alternatives.(like considered spam and reaped)

What would be the best approach to backengineer the original pattern to form a regex like this to match spam:

 .*? been .*? online more than .*? hours today, yet I never found any 
+interesting article like yours. .*? pretty worth enough for me. Yadda
+ yadda ...
[download]

My best guess till now is to use Bayesian filter on words to identify text units which are close enough to belong to the same spam template.

In the next step I would try to analyze those from left to right to find common sequences of words and to fill in the wildcards.

Any better idea?

Cheers rolf

_{(addicted to the Perl Programming Language and ☆☆☆☆ :)}

Comment on analyzing spam patterns Select or Download Code

Replies are listed 'Best First'.
Re: analyzing spam patterns by Corion (Patriarch) on Oct 04, 2014 at 11:54 UTC
There are Template::Extract and Template::Reverse which employ some heuristics to find the common and the templated parts given a corpus of text. Personally, I haven't used either to see what quality their results are.	[reply]
Re: analyzing spam patterns by Anonymous Monk on Oct 04, 2014 at 11:05 UTC
There are at least two modules which spit out regex given a list of strings ... w a i t ... see Generate a single "or regex" from given strings.	[reply]
Re^2: analyzing spam patterns by Anonymous Monk on Oct 04, 2014 at 11:44 UTC
On second thought, LanX seemed to have asked for something else, fill-in-the-regexp?.	[reply]