DejaSpam - Spamassassin accelerator

Introduction

spamassassin is a very nice collection of perl modules to analyze spam bodies for spam.

Only caveat: it's rather complex collection of tests can

It won't probably harm you, if you get around 1000 emails per day, but it will definitely be of harm in the order of 50 k to 100 k mails. (If you break the 100 k mail limit, you can ask for rsync access on prominent blocking lists such as Spamhaus or URIBL.)

The idea

Even with spam personalisation taken into account, the same spam will most probably send to you several times to different accounts. (Most spammers are too dumb to generate sensible personalisations.)

A test on a spam corpus of one day (60 k spams) revealed, that 50% of the spams are duplicates if one only considers Subject, X-, MIME-headers and the mail body.

What we want to do: don't make the effort on one spam twice. Ever.

That means: if spamassassin decides on one particular email, that it is spam it will be considered spam forever.

If one does a little bit heuristics like it is done within the DCC-network, one can bring that ratio easily up several percents, but that adds to the possibility to ban legitimate email forever. The DCC heuristics for instance consider only 4 KB of email. Some day ebay e-mails most probably will have the same header up to 4 KB till the real information starts, I'm sure. Since phishing mails pretend often to come from ebay, you will get into trouble someday, if you try to be too clever on generating fuzzy checksums. :)

The point here is to accelerate Spamassassin not to replace it!

So we stick to a MD5-sum calculation and remember the digest of spam within a berkeley-DB file.

Download

You can find the source files here:

dejaspam.c : dejaspam-client.
dejaspamd.c : dejaspam server
Makefile : Makefile

Author

Peter Schlaile - March 2008