Re: Most of the email spam I get is:

Let's see...

Character sets I can't read count as undeciperable, right? 306MB and counting. 166MB of that is GB2312 alone. (This is since August 27, 2002.) The various ks_c_ charsets between them account for another 60MB.
Stuff either doesn't specify what charset it's in or that's theoretically in character sets I can potentially read (mainly, UTF8, which I unfortunately can't filter because some people in the open-source community write English messages in it in preference to ASCII or Latin-1, for no discernible reason), but the subject line contains either long strings of non-alphanumeric characters, or nothing but alphanumeric characters, probably also counts as undecipherable. Another 141MB. A handful of these have long strings of punctuation in the subject, but most of them are Unicode messages written in a non-Latin writing system. 141MB since September 2003 when I wrote the rule.
That virus from a while back, "See the attached file for details", 235MB.
Assorted miscellany my filters didn't catch, 166MB (between 2004 April 23 and December 6; I start a new bin for this periodically so I can calculate the impact per-day and see how much it's increasing).
I did get one CPAN bug report once... for some reason I filed that under nnml:perl.* rather than under nnml:spam.*, go figure.

The unfiltered stuff (which lands in my inbox and gets shifted manually) is what annoys me most, and I'm continually looking for ways to reduce it, without getting false positives. (My experiments with Bayesian filtering were a wash; after training ifile on my entire very large corpus of mail, I found that I had to continually go through the whole spam bin for false positives. With the system I use now, I don't go through the filtered ones, only the unfiltered ones that land in my inbox.)

Some of the kinds of spam that land in my inbox include the following:

Messages with an enigmatic or vague subject line (that looks like a Markov chain or random dictionary words) and no content -- absolutely nothing in the body at all, no HTML part, no attachment, no nothing. I seem to get a fair amount of this, and I'm confused as to what possible reason the spammers could have for sending it.
419s. I haven't found a solid way to detect them (without false positives) yet.
Phony giveaways
Adverts for warez
pornography
Adverts for medical products that do not, in fact, exist: ways to reverse the aging process, cures for cancer, and the like
Spam written in Latin characters, but in a language I don't read. Spanish predominates in this category, but I've seen German, French, and I think Italian. If I get any Portuguese, I probably mistake it for Spanish.
Spam written using non-Latin characters (but without specifying the charset as such, either because it's not specified at all or because it's unicode) that slips past the filter rule for non-alphanumeric subject lines by throwing in alphanumeric characters in a few spots.
Various prescription meds adverts that slip past my filtering rules. Most of them seem to slip past, even though I've tried to be clever with my regular expressions. I write stuff like "^Subject.*[Vv].?[Ii1l|].?[Aa@].?[Gg].?[Rr].?[Aa@]" but they still find other ways to say it and slip past. I think they use lookalike Unicode characters. Did I mention that Unicode is a plague and a nuissance? Yeah.
Sundry other nonsense and junk.

However, even the stuff that gets filtered is a significant annoyance, because of the bandwidth it uses. I'm on 33.6 dialup here, so retrieving my mail takes a few minutes; when most of what I'm retrieving is unsolicited bulkmail, it's annoying to have to wait for that.

Comment on Re: Most of the email spam I get is: Download Code

Replies are listed 'Best First'.
Re^2: Most of the email spam I get is: by hostyle (Scribe) on Jan 03, 2005 at 12:54 UTC
Messages with an enigmatic or vague subject line (that looks like a Markov chain or random dictionary words) and no content -- absolutely nothing in the body at all, no HTML part, no attachment, no nothing. I seem to get a fair amount of this, and I'm confused as to what possible reason the spammers could have for sending it. Testing if its a valid email address? If it doesn't bounce your email address gets added to the "alive" list.	[reply]
Re^3: Most of the email spam I get is: by meredith (Friar) on Jan 04, 2005 at 22:20 UTC
I Disagree. They can't reliably get information on what addresses work from the transport mechanisms. The Mail Exchanger (MX) for any given domain may simply be a relay, and unable to tell the remote host if the/a recipient is invalid. If your MX is able to give that information, or is a relay that can do so by using LDAP lookups, I'd be surprised if the spambot actually cared about recording the status of that particular e-mail address (a lead, if you want to make it sound nice). Now, in the case that you have a relay, every message will get an OK status when the spambot delivers the message. When the message gets to a host that can say if the recipient is invalid, the relay that was connected to that host will make the "bounce" message -- I'll say "DSN" here. DSNs are sent to the envelope sender of the message. There's a very slim chance that the envelope sender of a spam message goes to some mailbox that tracks the status of leads. That would make blocking spam messages much easier for us Good Guys. Most of the time, they will use an invalid user at a valid domain. Sometimes, the user is valid. That's called a Joe Job, and the user or domain will start receiving thousands of DSNs for messages that they never sent. Not fun at all. I think that in this case, it's simply a mistake on the spammer's part. That sort of thing is rather common -- most often, I see messages that have a bunch of tokens that are meant to be substituted before the message goes out, but aren't. I've seen some other stupid ones before, too. `mhoward - at - hattmoward.org`	[reply]
Re^2: Most of the email spam I get is: by MarkusLaker (Beadle) on Jan 05, 2005 at 00:47 UTC
My experiments with Bayesian filtering were a wash; after training ifile on my entire very large corpus of mail, I found that I had to continually go through the whole spam bin for false positives. I did the same thing when I first came to Bayesian filtering, but that's not the way to get the best results out of it. Filtering is more accurate if you simply correct its mistakes as they occur than if you preload it with an existing corpus. There's much more information about Bayesian filtering at Paul Graham's site. Markus	[reply]
Re: Most of the email spam I get is: by jonadab (Parson) on Jan 05, 2005 at 22:14 UTC
Filtering is more accurate if you simply correct its mistakes as they occur If I have to correct false positives as the occur, this so-called "filtering" is no good to me at all, because it means I have to go through all the spam. Worse than useless. My existing filtering system is significantly better, because I am confident that 100.000% of everything filtered into the spam folders is, in fact, worthless junk. Additionally, most of my legitimate mail is filtered into various spam-free folders based on topic, list, sender or whatever. The only mail I have to sort by hand is the stuff that lands in my inbox (because none of my filters pick it up). I don't want to correct my filter's errors continually. If I have to do that, it's not doing its job at ALL; I would be doing 100% of the filter's job, then.	[reply]


Just another Perl shrine
	PerlMonks