this is obsolete doc -- see http://doc.nethence.com/ instead

Configuring Bayesian Mail Filter 

 

http://pbraun.nethence.com/unix/mail/procmail.html 

http://pbraun.nethence.com/unix/mail/procmail-bmf.html 

http://pbraun.nethence.com/unix/mail/procmail-qsf.html 

 

Installation 

Make sure Berkeley DB is available. On Redhat systems, 

rpm -q db4 db4-devel

 

Fetch, compile and install, 

wget http://sourceforge.net/projects/bmf/files/latest/download
tar xzf bmf-0.9.4.tar.gz
cd bmf-0.9.4/
./configure --help
./configure --without-mysql
make all
make install

 

Procmail configuration 

Assuming you are using procmail (well, sorry but I am -- adapt to your needs, eventually), 

cd ~/
vi .procmail

on top of all filter rules, add, 

#
# Bayesian Mail Filter
#
:0 fw
| bmf -p

 

:0:
* ^X-Spam-Status: Yes
spam

Note. bmf removes all spam status headers and puts his own. 

 

Crontab for learning 

Use your IMAP client to put the spam into e.g. the _bmf.learn and _bmf.unloearn mboxes to respectively let BMF learn what is spam and what isn't. Move the spam messages to the former and when a few false-positives show up in the .spam folder at the beginning, move them to the latter. 

 

You can then configure this script, 

cd ~/
mkdir -p bin/
cd bin/
vi cron.bmf

like, 

#!/bin/ksh
# proceeding as much as possible, no set -e

MAILDIR=/var/spool/virtual/example.net/user.imap 

 

learn() {
        print learning what is spam...\\c
        bmf -s < _bmf.learn && print \ done

 

        print reprocessing the _bmf.learn mbox...\\c
        reprocess-mbox-via-procmail _bmf.learn && print \ done
}

 

unlearn() {
        print unlearning false positives
        bmf -n < _bmf.unlearn && print WORKS
        #bmf -N < _bmf.unlearn

 

        print reprocessing the _bmf.unlearn mbox...\\c
        reprocess-mbox-via-procmail _bmf.unlearn && print \ done
}

 

cd $MAILDIR/
test -s _bmf.learn && learn || print ok _bmf.learn is empty
test -s _bmf.unlearn && unlearn || print ok _bmf.unlearn is empty

note. Change the MAILDIR variable accordingly. 

enable it, 

chmod +x bmf_learn

and run it every night with e.g. that crontab, 

SHELL=/bin/ksh
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/home/USERNAME/bin
HOME=/home/pbraun
MAILTO=root
LANG=en_US.UTF-8
59 2 * * * cron.bmf

Note. Change USERNAME accordingly. 

Note. It has to be traditional unix mbox kind of mail storage, not something else otherwise you won't be able to trail bmf with several emails at once. 

 

For Maildir folders, 

vi reprocess_maildir.learn

like, 

#!/bin/ksh
#
# Daily QSF learn & unlearn
#
set -e

 

fbayes() {
        [[ -z $1 ]] && print \$1 missing && exit 1
        [[ -z $2 ]] && print \$2 missing && exit 1

 

        find $1/cur $1/new -type f | while read msg; do
                print Processing $msg...\\c
                /usr/local/bin/procmail -m $2 < $msg
                /usr/local/bin/procmail < $msg
                rm -f $msg
                print \ Done
        done
}

 

cd $HOME/Maildir/

 

print Learning
fbayes .spam_learn $HOME/.procmailrc.learn
print ''

 

print Unlearning
fbayes .spam_unlearn $HOME/.procmailrc.unlearn
print ''

 

.procmailrc.learn being, 

SHELL=/bin/ksh
DROPPRIVS=yes
VERBOSE=no
ORGMAIL=$HOME/Maildir/
MAILDIR=$HOME/Maildir
DEFAULT=$ORGMAIL
SYSYEAR=`date +%Y`
LOGFILE=$HOME/.procmailrc.log.$SYSYEAR

 

#
# QSF learn
#
:0
| qsf -m

 

.procmailrc.unlearn being, 

SHELL=/bin/ksh
DROPPRIVS=yes
VERBOSE=no
ORGMAIL=$HOME/Maildir/
MAILDIR=$HOME/Maildir
DEFAULT=$ORGMAIL
SYSYEAR=`date +%Y`
LOGFILE=$HOME/.procmailrc.log.$SYSYEAR

 

#
# QSF unlearn
#
:0
| qsf -M

 

References 

Bayesian (http://acme.com/mail_filtering/bayesian_frameset.html) 

bmf: Bayesian Mail Filter (http://jblevins.org/log/bmf) 

bmf training from cron (http://comments.gmane.org/gmane.mail.bmf.user/38) 

Filtering spam with bmf, procmail and mutt (http://e.molioner.dk/guides/bmfprocmailmutt) 

Flail Spam Mitigation Setup (http://flail.org/spam.html) 

 

Note. this benchmark has forgotten BMF ! 

The Grumpy Editor's guide to bayesian spam filters (http://lwn.net/Articles/172491/) 

A grumpy editor's bayesian followup (http://lwn.net/Articles/173910/) 

 

Original papers 

A Plan for Spam (http://paulgraham.com/spam.html) 08.2002 

Better Bayesian Filtering (http://paulgraham.com/better.html) 01.2003