On Sat, Jan 03, 2004 at 11:25:20PM +0000, Niall O Broin wrote:
> The thought occurs then that perhaps some kind of addition to a Bayesian
> filter is necessary, so that if a message has above a certain threshold of
> unknown words, it scores highly. Even better might be to compare every word
> against a dictionary, and again give high score to a mail which contained
> lots of non-dictionary words. Lots of problems with this, of course, not
> least being the CPU cost of such an approach.
Bayesian is not for the CPU-shy in any event :) Much bigger problems
would be posed by those among us who write and recieve e-mail in
a variety of languages, sometimes even intersparsed :/ That and the
modern madness that is txt sp3k, now to be found in many peoples
daily e-mails :(
--
Colm MacCárthaigh Public Key: colm+pgp at stdlib.net
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!