this is obsolete doc -- see http://doc.nethence.com/ instead
Configuring Bayesian Mail Filter
http://pbraun.nethence.com/unix/mail/procmail.html
http://pbraun.nethence.com/unix/mail/procmail-bmf.html
http://pbraun.nethence.com/unix/mail/procmail-qsf.html
Installation
Make sure Berkeley DB is available. On Redhat systems,
rpm -q db4 db4-devel
Fetch, compile and install,
wget http://sourceforge.net/projects/bmf/files/latest/download
tar xzf bmf-0.9.4.tar.gz
cd bmf-0.9.4/
./configure --help
./configure --without-mysql
make all
make install
Procmail configuration
Assuming you are using procmail (well, sorry but I am -- adapt to your needs, eventually),
cd ~/
vi .procmail
on top of all filter rules, add,
#
# Bayesian Mail Filter
#
:0 fw
| bmf -p
:0:
* ^X-Spam-Status: Yes
spam
Note. bmf removes all spam status headers and puts his own.
Crontab for learning
Use your IMAP client to put the spam into e.g. the _bmf.learn and _bmf.unloearn mboxes to respectively let BMF learn what is spam and what isn't. Move the spam messages to the former and when a few false-positives show up in the .spam folder at the beginning, move them to the latter.
You can then configure this script,
cd ~/
mkdir -p bin/
cd bin/
vi cron.bmf
like,
#!/bin/ksh
# proceeding as much as possible, no set -e
MAILDIR=/var/spool/virtual/example.net/user.imap
learn() {
print learning what is spam...\\c
bmf -s < _bmf.learn && print \ done
print reprocessing the _bmf.learn mbox...\\c
reprocess-mbox-via-procmail _bmf.learn && print \ done
}
unlearn() {
print unlearning false positives
bmf -n < _bmf.unlearn && print WORKS
#bmf -N < _bmf.unlearn
print reprocessing the _bmf.unlearn mbox...\\c
reprocess-mbox-via-procmail _bmf.unlearn && print \ done
}
cd $MAILDIR/
test -s _bmf.learn && learn || print ok _bmf.learn is empty
test -s _bmf.unlearn && unlearn || print ok _bmf.unlearn is empty
note. Change the MAILDIR variable accordingly.
enable it,
chmod +x bmf_learn
and run it every night with e.g. that crontab,
SHELL=/bin/ksh
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/home/USERNAME/bin
HOME=/home/pbraun
MAILTO=root
LANG=en_US.UTF-8
59 2 * * * cron.bmf
Note. Change USERNAME accordingly.
Note. It has to be traditional unix mbox kind of mail storage, not something else otherwise you won't be able to trail bmf with several emails at once.
For Maildir folders,
vi reprocess_maildir.learn
like,
#!/bin/ksh
#
# Daily QSF learn & unlearn
#
set -e
fbayes() {
[[ -z $1 ]] && print \$1 missing && exit 1
[[ -z $2 ]] && print \$2 missing && exit 1
find $1/cur $1/new -type f | while read msg; do
print Processing $msg...\\c
/usr/local/bin/procmail -m $2 < $msg
/usr/local/bin/procmail < $msg
rm -f $msg
print \ Done
done
}
cd $HOME/Maildir/
print Learning
fbayes .spam_learn $HOME/.procmailrc.learn
print ''
print Unlearning
fbayes .spam_unlearn $HOME/.procmailrc.unlearn
print ''
.procmailrc.learn being,
SHELL=/bin/ksh
DROPPRIVS=yes
VERBOSE=no
ORGMAIL=$HOME/Maildir/
MAILDIR=$HOME/Maildir
DEFAULT=$ORGMAIL
SYSYEAR=`date +%Y`
LOGFILE=$HOME/.procmailrc.log.$SYSYEAR
#
# QSF learn
#
:0
| qsf -m
.procmailrc.unlearn being,
SHELL=/bin/ksh
DROPPRIVS=yes
VERBOSE=no
ORGMAIL=$HOME/Maildir/
MAILDIR=$HOME/Maildir
DEFAULT=$ORGMAIL
SYSYEAR=`date +%Y`
LOGFILE=$HOME/.procmailrc.log.$SYSYEAR
#
# QSF unlearn
#
:0
| qsf -M
References
Bayesian (http://acme.com/mail_filtering/bayesian_frameset.html)
bmf: Bayesian Mail Filter (http://jblevins.org/log/bmf)
bmf training from cron (http://comments.gmane.org/gmane.mail.bmf.user/38)
Filtering spam with bmf, procmail and mutt (http://e.molioner.dk/guides/bmfprocmailmutt)
Flail Spam Mitigation Setup (http://flail.org/spam.html)
Note. this benchmark has forgotten BMF !
The Grumpy Editor's guide to bayesian spam filters (http://lwn.net/Articles/172491/)
A grumpy editor's bayesian followup (http://lwn.net/Articles/173910/)
Original papers
A Plan for Spam (http://paulgraham.com/spam.html) 08.2002
Better Bayesian Filtering (http://paulgraham.com/better.html) 01.2003