Configuring Bayesian Mail Filter

this is obsolete doc -- see http://doc.nethence.com/ instead

http://pbraun.nethence.com/unix/mail/procmail.html

http://pbraun.nethence.com/unix/mail/procmail-bmf.html

http://pbraun.nethence.com/unix/mail/procmail-qsf.html

Installation

Make sure Berkeley DB is available. On Redhat systems,

rpm -q db4 db4-devel

Fetch, compile and install,

wget http://sourceforge.net/projects/bmf/files/latest/download

tar xzf bmf-0.9.4.tar.gz

cd bmf-0.9.4/

./configure --help

./configure --without-mysql

make all

make install

Procmail configuration

Assuming you are using procmail (well, sorry but I am -- adapt to your needs, eventually),

cd ~/

vi .procmail

on top of all filter rules, add,

# Bayesian Mail Filter

:0 fw

| bmf -p

:0:

* ^X-Spam-Status: Yes

spam

Note. bmf removes all spam status headers and puts his own.

Crontab for learning

Use your IMAP client to put the spam into e.g. the _bmf.learn and _bmf.unloearn mboxes to respectively let BMF learn what is spam and what isn't. Move the spam messages to the former and when a few false-positives show up in the .spam folder at the beginning, move them to the latter.

You can then configure this script,

cd ~/

mkdir -p bin/

cd bin/

vi cron.bmf

like,

#!/bin/ksh

# proceeding as much as possible, no set -e

MAILDIR=/var/spool/virtual/example.net/user.imap

learn() {

        print learning what is spam...\\c

        bmf -s < _bmf.learn && print \ done

        print reprocessing the _bmf.learn mbox...\\c

        reprocess-mbox-via-procmail _bmf.learn && print \ done

unlearn() {

        print unlearning false positives

        bmf -n < _bmf.unlearn && print WORKS

        #bmf -N < _bmf.unlearn

        print reprocessing the _bmf.unlearn mbox...\\c

        reprocess-mbox-via-procmail _bmf.unlearn && print \ done

cd $MAILDIR/

test -s _bmf.learn && learn || print ok _bmf.learn is empty

test -s _bmf.unlearn && unlearn || print ok _bmf.unlearn is empty

note. Change the MAILDIR variable accordingly.

enable it,

chmod +x bmf_learn

and run it every night with e.g. that crontab,

SHELL=/bin/ksh

PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/home/USERNAME/bin

HOME=/home/pbraun

MAILTO=root

LANG=en_US.UTF-8

59 2 * * * cron.bmf

Note. Change USERNAME accordingly.

Note. It has to be traditional unix mbox kind of mail storage, not something else otherwise you won't be able to trail bmf with several emails at once.

For Maildir folders,

vi reprocess_maildir.learn

like,

#!/bin/ksh

# Daily QSF learn & unlearn

set -e

fbayes() {

        [[ -z $1 ]] && print \$1 missing && exit 1

        [[ -z $2 ]] && print \$2 missing && exit 1

        find $1/cur $1/new -type f | while read msg; do

                print Processing $msg...\\c

                /usr/local/bin/procmail -m $2 < $msg

                /usr/local/bin/procmail < $msg

                rm -f $msg

                print \ Done

        done

cd $HOME/Maildir/

print Learning

fbayes .spam_learn $HOME/.procmailrc.learn

print ''

print Unlearning

fbayes .spam_unlearn $HOME/.procmailrc.unlearn

print ''

.procmailrc.learn being,

SHELL=/bin/ksh

DROPPRIVS=yes

VERBOSE=no

ORGMAIL=$HOME/Maildir/

MAILDIR=$HOME/Maildir

DEFAULT=$ORGMAIL

SYSYEAR=`date +%Y`

LOGFILE=$HOME/.procmailrc.log.$SYSYEAR

# QSF learn

:0

| qsf -m

.procmailrc.unlearn being,

SHELL=/bin/ksh

DROPPRIVS=yes

VERBOSE=no

ORGMAIL=$HOME/Maildir/

MAILDIR=$HOME/Maildir

DEFAULT=$ORGMAIL

SYSYEAR=`date +%Y`

LOGFILE=$HOME/.procmailrc.log.$SYSYEAR

# QSF unlearn

:0

| qsf -M

References

Bayesian (http://acme.com/mail_filtering/bayesian_frameset.html)

bmf: Bayesian Mail Filter (http://jblevins.org/log/bmf)

bmf training from cron (http://comments.gmane.org/gmane.mail.bmf.user/38)

Filtering spam with bmf, procmail and mutt (http://e.molioner.dk/guides/bmfprocmailmutt)

Flail Spam Mitigation Setup (http://flail.org/spam.html)

Note. this benchmark has forgotten BMF !

The Grumpy Editor's guide to bayesian spam filters (http://lwn.net/Articles/172491/)

A grumpy editor's bayesian followup (http://lwn.net/Articles/173910/)

Original papers

A Plan for Spam (http://paulgraham.com/spam.html) 08.2002

Better Bayesian Filtering (http://paulgraham.com/better.html) 01.2003