Only caveat: it's rather complex collection of tests can
A test on a spam corpus of one day (60 k spams) revealed, that 50% of the spams are duplicates if one only considers Subject, X-, MIME-headers and the mail body.
What we want to do: don't make the effort on one spam twice. Ever.
That means: if spamassassin decides on one particular email, that it is spam it will be considered spam forever.
If one does a little bit heuristics like it is done within the DCC-network, one can bring that ratio easily up several percents, but that adds to the possibility to ban legitimate email forever. The DCC heuristics for instance consider only 4 KB of email. Some day ebay e-mails most probably will have the same header up to 4 KB till the real information starts, I'm sure. Since phishing mails pretend often to come from ebay, you will get into trouble someday, if you try to be too clever on generating fuzzy checksums. :)
The point here is to accelerate Spamassassin not to replace it!
So we stick to a MD5-sum calculation and remember the digest of spam within a berkeley-DB file.
dejaspam.c : dejaspam-client.
dejaspamd.c : dejaspam server
Makefile : Makefile