Developer info

So you want to get involved?

Running the code

This project works with Python 2.2 or later, available from Subversion on SourceForge. It will not work on python 2.1.x or earlier, nor is it ever likely to do so.

If you're running Python 2.2 or 2.2.1, you'll need to separately fetch the latest email package. You can get this from sourceforge (you'll need version 2.4.3 or later - version 3.0 or later is recommended).

The SpamBayes code itself is also available via via Subversion, or from the download page.

I just want to make suggestions

Excellent! Note though, that this project takes a very results-oriented approach to code changes - if the change doesn't produce an improvement in results from various test corpuses, it's not going to get very far.

Note that a lot of "intuitive" approaches and ideas end up making things worse, not better - it seems that stupid beats smart in many or even most cases.

There's a bunch of documentation on things that have already been tried available as links from the documentation page.

So what needs to be done

1.0 was released in July 2004, and was followed up by three bugfix releases starting in November 2004. The current stable release is 1.0.4. This is likely to be the final release in the 1.0.x line.

Since May 2004, work has been carried out on a 1.1 release, which includes many improvements, as well as bug fixes, compared to the 1.0.x branch. The latest alpha release is 1.1a6 (April 2010). If we could find more time or more help we could get to beta, release candidate and final releases of 1.1. We hope that a stable 1.1 release will be made during 2007, although this date is certainly not fixed.

The 1.1 line will be frozen for non-bugfix changes from the first beta release. Many of the changes desired by the developers have been implemented, or partly so, but there is still time for further improvement. There is no time limit on implementing bug fixes.

Some key work that is in progress for 1.1, which you could assist with (particularly in testing) includes:

  • Internationalisation. A small number of people have expressed interest in translating the SpamBayes documentation and user interfaces into other languages.
  • Database backend. The majority of SpamBayes users use the bsddb support included with Python (a few use a pickled Python dict, and even fewer use the experimental mySQL/postgreSQL support). The bsddb solution is problematic, in that users' databases sometimes get corrupted, but we don't know what causes that. New backends, particularly ZODB/ZEO, have been added, and the SQL backends improved.
  • Improvement in the unit testing suite.
  • Testing and/or improving the image handling capabilities. 1.1a3 introduced OCR capability using the open source gocr program and PIL.
  • Testing the new core_server.py application which implements a plugin architecture for external protocol adapters. The first adapter provides an XML-RPC interface, making it possible to extend SpamBayes to websites and other non-mail applications. You could interface this server to web applications such as Trac, MoinMoin or your favorite blog software. You could also implement a POP3 protocol adapter so we can merge core_server.py and sb-server.py.

The other big body of work is monitoring the bug reports and feature requests that come in and trying to resolve those.

Collecting training data

One of the tricky problems is collecting a set of data that's "good enough" There's a few collections of spam out on the net - note though, that using spam and ham from different sources often leads to the classifier picking up on these clues -- for instance, a different set of hostnames in the Received: headers. This isn't a killer problem - it just means that you need to think harder about what to feed into the classifier.