Thanks to contributions from a couple of people the other day I came up with
this little script to produce a small report on the Bayes DB:
echo Spam Assassin Bayes Statistics
echo ""
echo Bayes Token Count
echo "Total Ham Spam"
sa-learn --dump |awk '{count += 1; if ($0 > 0.5) spam+=1; \
if ($0 < 0.5) ham+=1} END {print count "\t" ham "\t" spam}'
echo ""
echo -n "Number of ham messages learnt from: "
sa-learn --dump magic |awk '/nham/ {print $3}'
echo -n "Number of spam messages learnt from: "
sa-learn --dump magic |awk '/nspam/ {print $3}'
which runs at tne end of a script which sa-learns spam placed in folders by
humans during the day. After doing its nightly run, it reported as follows:
Spam Assassin Bayes Statistics
Bayes Token Count
Total Ham Spam
140114 78443 61671
Number of ham messages learnt from: 2109
Number of spam messages learnt from: 1387
I then fed sa-learn something over 1000 pieces of ham, and now the same script
gives me:
Spam Assassin Bayes Statistics
Bayes Token Count
Total Ham Spam
153518 10 153508
Number of ham messages learnt from: 2850
Number of spam messages learnt from: 0
AARGH! - what the hell has happened there. It has forgotten about ALL the spam
messages it ever learnt from, apparently, but conversely, 78000 ham tokens
have become spam tokens.
Straight sa-learn --dump magic now gives
0.000 0 3 0 non-token data: bayes db version
0.000 0 0 0 non-token data: nspam
0.000 0 2850 0 non-token data: nham
0.000 0 153508 0 non-token data: ntokens
0.000 0 1091609393 0 non-token data: oldest atime
0.000 0 1107564300 0 non-token data: newest atime
0.000 0 1107564852 0 non-token data: last journal sync atime
0.000 0 1107564590 0 non-token data: last expiry atime
0.000 0 1382400 0 non-token data: last expire atime delta
0.000 0 17827 0 non-token data: last expire reduction count
whereas sa-learn --dump magic from the databases as of 19:00 last night
(retrieved from the warm standby box) gives
0.000 0 3 0 non-token data: bayes db version
0.000 0 1342 0 non-token data: nspam
0.000 0 2096 0 non-token data: nham
0.000 0 138010 0 non-token data: ntokens
0.000 0 1106096390 0 non-token data: oldest atime
0.000 0 1107544172 0 non-token data: newest atime
0.000 0 1107538029 0 non-token data: last journal sync atime
0.000 0 1107478750 0 non-token data: last expiry atime
0.000 0 1382400 0 non-token data: last expire atime delta
0.000 0 5589 0 non-token data: last expire reduction count
Can anyone shed any light on this?
--
Niall
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!