|
Pre-Sourceforge ChangeLog
This changelog lists the commits on the spambayes projects before the
separate project was set up. See also the
old CVS repository, but don't forget that it's now out of date, and you probably want to be looking at the current CVS.
2002-09-06 02:27 tim_one
* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
This code has been moved to a new SourceForge project (spambayes).
2002-09-05 15:37 tim_one
* classifier.py (1.11):
Added note about MINCOUNT oddities.
2002-09-05 14:32 tim_one
* timtest.py (1.17):
Added note about word length.
2002-09-05 13:48 tim_one
* timtest.py (1.16):
tokenize_word(): Oops! This was awfully permissive in what it
took as being "an email address". Tightened that, and also
avoided 5-gram'ing of email addresses w/ high-bit characters.
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.050 lost
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 0 times
tied 19 times
lost 1 times
total unique fp went from 7 to 8
false negative percentages
0.764 0.691 won
0.691 0.655 won
0.981 0.945 won
1.309 1.309 tied
1.418 1.164 won
0.873 0.800 won
0.800 0.763 won
1.163 1.163 tied
1.491 1.345 won
1.200 1.127 won
1.381 1.345 won
1.454 1.490 lost
1.164 0.909 won
0.655 0.582 won
0.655 0.691 lost
1.163 1.163 tied
1.200 1.018 won
0.982 0.873 won
0.982 0.909 won
1.236 1.127 won
won 15 times
tied 3 times
lost 2 times
total unique fn went from 260 to 249
Note: Each of the two losses there consist of just 1 msg difference.
The wins are bigger as well as being more common, and 260-249 = 11
spams no longer sneak by any run (which is more than 4% of the 260
spams that used to sneak thru!).
2002-09-05 11:51 tim_one
* classifier.py (1.10):
Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
really matter; leaving it alone.
2002-09-05 10:02 tim_one
* classifier.py (1.9):
A now-rare pure win, changing spamprob() to work harder to find more
evidence when competing 0.01 and 0.99 clues appear. Before in the left
column, after in the right:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.075 0.025 won
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 1 times
tied 19 times
lost 0 times
total unique fp went from 9 to 7
false negative percentages
0.909 0.764 won
0.800 0.691 won
1.091 0.981 won
1.381 1.309 won
1.491 1.418 won
1.055 0.873 won
0.945 0.800 won
1.236 1.163 won
1.564 1.491 won
1.200 1.200 tied
1.454 1.381 won
1.599 1.454 won
1.236 1.164 won
0.800 0.655 won
0.836 0.655 won
1.236 1.163 won
1.236 1.200 won
1.055 0.982 won
1.127 0.982 won
1.381 1.236 won
won 19 times
tied 1 times
lost 0 times
total unique fn went from 284 to 260
2002-09-04 11:21 tim_one
* timtest.py (1.15):
Augmented the spam callback to display spams with low probability.
2002-09-04 09:53 tim_one
* Tester.py (1.3), timtest.py (1.14):
Added support for simple histograms of the probability distributions for
ham and spam.
2002-09-03 12:13 tim_one
* timtest.py (1.13):
A reluctant "on principle" change no matter what it does to the stats:
take a stab at removing HTML decorations from plain text msgs. See
comments for why it's *only* in plain text msgs. This puts an end to
false positives due to text msgs talking *about* HTML. Surprisingly, it
also gets rid of some false negatives. Not surprisingly, it introduced
another small class of false positives due to the dumbass regexp trick
used to approximate HTML tag removal removing pieces of text that had
nothing to do with HTML tags (e.g., this happened in the middle of a
uuencoded .py file in such a why that it just happened to leave behind
a string that "looked like" a spam phrase; but before this it looked
like a pile of "too long" lines that didn't generate any tokens --
it's a nonsense outcome either way).
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.025 lost
0.075 0.075 tied
0.050 0.025 won
0.025 0.025 tied
0.000 0.025 lost
0.050 0.075 lost
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 1 times
tied 16 times
lost 3 times
total unique fp went from 8 to 9
false negative percentages
0.945 0.909 won
0.836 0.800 won
1.200 1.091 won
1.418 1.381 won
1.455 1.491 lost
1.091 1.055 won
1.091 0.945 won
1.236 1.236 tied
1.564 1.564 tied
1.236 1.200 won
1.563 1.454 won
1.563 1.599 lost
1.236 1.236 tied
0.836 0.800 won
0.873 0.836 won
1.236 1.236 tied
1.273 1.236 won
1.018 1.055 lost
1.091 1.127 lost
1.490 1.381 won
won 12 times
tied 4 times
lost 4 times
total unique fn went from 292 to 284
2002-09-03 06:57 tim_one
* classifier.py (1.8):
Added a new xspamprob() method, which computes the combined probability
"correctly", and a long comment block explaining what happened when I
tried it. There's something worth pursuing here (it greatly improves
the false negative rate), but this change alone pushes too many marginal
hams into the spam camp
2002-09-03 05:23 tim_one
* timtest.py (1.12):
Made "skip:" tokens shorter.
Added a surprising treatment of Organization headers, with a tiny f-n
benefit for a tiny cost. No change in f-p stats.
false negative percentages
1.091 0.945 won
0.945 0.836 won
1.236 1.200 won
1.454 1.418 won
1.491 1.455 won
1.091 1.091 tied
1.127 1.091 won
1.236 1.236 tied
1.636 1.564 won
1.345 1.236 won
1.672 1.563 won
1.599 1.563 won
1.236 1.236 tied
0.836 0.836 tied
1.018 0.873 won
1.236 1.236 tied
1.273 1.273 tied
1.055 1.018 won
1.091 1.091 tied
1.527 1.490 won
won 13 times
tied 7 times
lost 0 times
total unique fn went from 302 to 292
2002-09-03 02:18 tim_one
* timtest.py (1.11):
tokenize_word(): dropped the prefix from the signature; it's faster
to let the caller do it, and this also repaired a bug in one place it
was being used (well, a *conceptual* bug anyway, in that the code didn't
do what I intended there). This changes the stats in an insignificant
way. The f-p stats didn't change. The f-n stats shifted by one message
in a few cases:
false negative percentages
1.091 1.091 tied
0.945 0.945 tied
1.200 1.236 lost
1.454 1.454 tied
1.491 1.491 tied
1.091 1.091 tied
1.091 1.127 lost
1.236 1.236 tied
1.636 1.636 tied
1.382 1.345 won
1.636 1.672 lost
1.599 1.599 tied
1.236 1.236 tied
0.836 0.836 tied
1.018 1.018 tied
1.236 1.236 tied
1.273 1.273 tied
1.055 1.055 tied
1.091 1.091 tied
1.527 1.527 tied
won 1 times
tied 16 times
lost 3 times
total unique unchanged
2002-09-02 19:30 tim_one
* timtest.py (1.10):
Don't ask me why this helps -- I don't really know! When skipping "long
words", generating a token with a brief hint about what and how much got
skipped makes a definite improvement in the f-n rate, and doesn't affect
the f-p rate at all. Since experiment said it's a winner, I'm checking
it in. Before (left columan) and after (right column):
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.075 0.075 tied
0.050 0.050 tied
0.025 0.025 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 0 times
tied 20 times
lost 0 times
total unique fp went from 8 to 8
false negative percentages
1.236 1.091 won
1.164 0.945 won
1.454 1.200 won
1.599 1.454 won
1.527 1.491 won
1.236 1.091 won
1.163 1.091 won
1.309 1.236 won
1.891 1.636 won
1.418 1.382 won
1.745 1.636 won
1.708 1.599 won
1.491 1.236 won
0.836 0.836 tied
1.091 1.018 won
1.309 1.236 won
1.491 1.273 won
1.127 1.055 won
1.309 1.091 won
1.636 1.527 won
won 19 times
tied 1 times
lost 0 times
total unique fn went from 336 to 302
2002-09-02 17:55 tim_one
* timtest.py (1.9):
Some comment changes and nesting reduction.
2002-09-02 11:18 tim_one
* timtest.py (1.8):
Fixed some out-of-date comments.
Made URL clumping lumpier: now distinguishes among just "first field",
"second field", and "everything else".
Changed tag names for email address fields (semantically neutral).
Added "From:" line tagging.
These add up to an almost pure win. Before-and-after f-n rates across 20
runs:
1.418 1.236
1.309 1.164
1.636 1.454
1.854 1.599
1.745 1.527
1.418 1.236
1.381 1.163
1.418 1.309
2.109 1.891
1.491 1.418
1.854 1.745
1.890 1.708
1.818 1.491
1.055 0.836
1.164 1.091
1.599 1.309
1.600 1.491
1.127 1.127
1.164 1.309
1.781 1.636
It only increased in one run. The variance appears to have been reduced
too (I didn't bother to compute that, though).
Before-and-after f-p rates across 20 runs:
0.000 0.000
0.000 0.000
0.075 0.050
0.000 0.000
0.025 0.025
0.050 0.025
0.075 0.050
0.025 0.025
0.025 0.025
0.025 0.000
0.100 0.075
0.050 0.050
0.025 0.025
0.000 0.000
0.075 0.050
0.025 0.025
0.025 0.025
0.000 0.000
0.075 0.025
0.100 0.050
Note that 0.025% is a single message; it's really impossible to *measure*
an improvement in the f-p rate anymore with 4000-msg ham sets.
Across all 20 runs,
the total # of unique f-n fell from 353 to 336
the total # of unique f-p fell from 13 to 8
2002-09-02 10:06 tim_one
* timtest.py (1.7):
A number of changes. The most significant is paying attention to the
Subject line (I was wrong before when I said my c.l.py ham corpus was
unusable for this due to Mailman-injected decorations). In all, across
my 20 test runs,
the total # of unique false positives fell from 23 to 13
the total # of unique false negatives rose from 337 to 353
Neither result is statistically significant, although I bet the first
one would be if I pissed away a few days trying to come up with a more
realistic model for what "stat. sig." means here .
2002-09-01 17:22 tim_one
* classifier.py (1.7):
Added a comment block about HAMBIAS experiments. There's no clearer
example of trading off precision against recall, and you can favor either
at the expense of the other to any degree you like by fiddling this knob.
2002-09-01 14:42 tim_one
* timtest.py (1.6):
Long new comment block summarizing all my experiments with character
n-grams. Bottom line is that they have nothing going for them, and a
lot going against them, under Graham's scheme. I believe there may
still be a place for them in *part* of a word-based tokenizer, though.
2002-09-01 10:05 tim_one
* classifier.py (1.6):
spamprob(): Never count unique words more than once anymore. Counting
up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
that's now a small drag instead.
2002-09-01 07:33 tim_one
* rebal.py (1.3), timtest.py (1.5):
Folding case is here to stay. Read the new comments for why. This may
be a bad idea for other languages, though.
Refined the embedded-URL tagging scheme. Curious: as a protocol,
http is spam-neutral, but https is a strong spam indicator. That
surprised me.
2002-09-01 06:47 tim_one
* classifier.py (1.5):
spamprob(): Removed useless check that wordstream isn't empty. For one
thing, it didn't work, since wordstream is often an iterator. Even if
it did work, it isn't needed -- the probability of an empty wordstream
gets computed as 0.5 based on the total absence of evidence.
2002-09-01 05:37 tim_one
* timtest.py (1.4):
textparts(): Worm around what feels like a bug in msg.walk() (Barry has
details).
2002-09-01 05:09 tim_one
* rebal.py (1.2):
Aha! Staring at the checkin msg revealed a logic bug that explains why
my ham directories sometimes remained unbalanced after running this --
if the randomly selected reservoir msg turned out to be spam, it wasn't
pushing the too-small directory on the stack again.
2002-09-01 04:56 tim_one
* timtest.py (1.3):
textparts(): This was failing to weed out redundant HTML in cases like
this:
multipart/alternative
text/plain
multipart/related
text/html
The tokenizer here also transforms everything to lowercase, but that's
an accident due simply to that I'm testing that now. Can't say for
sure until the test runs end, but so far it looks like a bad idea for
the false positive rate.
2002-09-01 04:52 tim_one
* rebal.py (1.1):
A little script I use to rebalance the ham corpora after deleting what
turns out to be spam. I have another Ham/reservoir directory with a
few thousand randomly selected msgs from the presumably-good archive.
These aren't used in scoring or training. This script marches over all
the ham corpora directories that are used, and if any have gotten too
big (this never happens anymore) deletes msgs at random from them, and
if any have gotten too small plugs the holes by moving in random
msgs from the reservoir.
2002-09-01 03:25 tim_one
* classifier.py (1.4), timtest.py (1.2):
Boost UNKNOWN_SPAMPROB.
# The spam probability assigned to words never seen before. Graham used
# 0.2 here. Neil Schemenauer reported that 0.5 seemed to work better. In
# Tim's content-only tests (no headers), boosting to 0.5 cut the false
# negative rate by over 1/3. The f-p rate increased, but there were so few
# f-ps that the increase wasn't statistically significant. It also caught
# 13 more spams erroneously classified as ham. By eyeball (and common
# sense ), this has most effect on very short messages, where there
# simply aren't many high-value words. A word with prob 0.5 is (in effect)
# completely ignored by spamprob(), in favor of *any* word with *any* prob
# differing from 0.5. At 0.2, an unknown word favors ham at the expense
# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
# on the face of it.
2002-08-31 16:50 tim_one
* timtest.py (1.1):
This is a driver I've been using for test runs. It's specific to my
corpus directories, but has useful stuff in it all the same.
2002-08-31 16:49 tim_one
* classifier.py (1.3):
The explanation for these changes was on Python-Dev. You'll find out
why if the moderator approves the msg .
2002-08-29 07:04 tim_one
* Tester.py (1.2), classifier.py (1.2):
Tester.py: Repaired a comment. The false_{positive,negative})_rate()
functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
too hard to get motivated to reduce 0.01 <0.1 wink>).
GrahamBayes.spamprob: New optional bool argument; when true, a list of
the 15 strongest (word, probability) pairs is returned as well as the
overall probability (this is how to find out why a message scored as it
did).
2002-08-28 13:45 montanaro
* GBayes.py (1.15):
ehh - it actually didn't work all that well. the spurious report that it
did well was pilot error. besides, tim's report suggests that a simple
str.split() may be the best tokenizer anyway.
2002-08-28 10:45 montanaro
* setup.py (1.1):
trivial little setup.py file - i don't expect most people will be interested
in this, but it makes it a tad simpler to work with now that there are two
files
2002-08-28 10:43 montanaro
* GBayes.py (1.14):
add simple trigram tokenizer - this seems to yield the best results I've
seen so far (but has not been extensively tested)
2002-08-28 08:10 tim_one
* Tester.py (1.1):
A start at a testing class. There isn't a lot here, but it automates
much of the tedium, and as the doctest shows it can already do
useful things, like remembering which inputs were misclassified.
2002-08-27 06:45 tim_one
* mboxcount.py (1.5):
Updated stats to what Barry and I both get now. Fiddled output.
2002-08-27 05:09 bwarsaw
* split.py (1.5), splitn.py (1.2):
_factory(): Return the empty string instead of None in the except
clauses, so that for-loops won't break prematurely. mailbox.py's base
class defines an __iter__() that raises a StopIteration on None
return.
2002-08-27 04:55 tim_one
* GBayes.py (1.13), mboxcount.py (1.4):
Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
2002-08-27 04:40 bwarsaw
* mboxcount.py (1.3):
Some stats after splitting b/w good messages and unparseable messages
2002-08-27 04:23 bwarsaw
* mboxcount.py (1.2):
_factory(): Use a marker object to designate between good messages and
unparseable messages. For some reason, returning None from the except
clause in _factory() caused Python 2.2.1 to exit early out of the for
loop.
main(): Print statistics about both the number of good messages and
the number of unparseable messages.
2002-08-27 03:06 tim_one
* cleanarch (1.2):
"From " is a header more than a separator, so don't bump the msg count
at the end.
2002-08-24 01:42 tim_one
* GBayes.py (1.12), classifier.py (1.1):
Moved all the interesting code that was in the *original* GBayes.py into
a new classifier.py. It was designed to have a very clean interface,
and there's no reason to keep slamming everything into one file. The
ever-growing tokenizer stuff should probably also be split out, leaving
GBayes.py a pure driver.
Also repaired _test() (Skip's checkin left it without a binding for
the tokenize function).
2002-08-24 01:17 tim_one
* splitn.py (1.1):
Utility to split an mbox into N random pieces in one gulp. This gives
a convenient way to break a giant corpus into multiple files that can
then be used independently across multiple training and testing runs.
It's important to do multiple runs on different random samples to avoid
drawing conclusions based on accidents in a single random training corpus;
if the algorithm is robust, it should have similar performance across
all runs.
2002-08-24 00:25 montanaro
* GBayes.py (1.11):
Allow command line specification of tokenize functions
run w/ -t flag to override default tokenize function
run w/ -H flag to see list of tokenize functions
When adding a new tokenizer, make docstring a short description and add a
key/value pair to the tokenizers dict. The key is what the user specifies.
The value is a tokenize function.
Added two new tokenizers - tokenize_wordpairs_foldcase and
tokenize_words_and_pairs. It's not obvious that either is better than any
of the preexisting functions.
Should probably add info to the pickle which indicates the tokenizing
function used to build it. This could then be the default for spam
detection runs.
Next step is to drive this with spam/non-spam corpora, selecting each of the
various tokenizer functions, and presenting the results in tabular form.
2002-08-23 13:10 tim_one
* GBayes.py (1.10):
spamprob(): Commented some subtleties.
clearjunk(): Undid Guido's attempt to space-optimize this. The problem
is that you can't delete entries from a dict that's being crawled over
by .iteritems(), which is why I (I suddenly recall) materialized a
list of words to be deleted the first time I wrote this. It's a lot
better to materialize a list of to-be-deleted words than to materialize
the entire database in a dict.items() list.
2002-08-23 12:36 tim_one
* mboxcount.py (1.1):
Utility to count and display the # of msgs in (one or more) Unix mboxes.
2002-08-23 12:11 tim_one
* split.py (1.4):
Open files in binary mode. Else, e.g., about 400MB of Barry's python-list
corpus vanishes on Windows. Also use file.write() instead of print>>, as
the latter invents an extra newline.
2002-08-22 07:01 tim_one
* GBayes.py (1.9):
Renamed "modtime" to "atime", to better reflect its meaning, and added a
comment block to explain that better.
2002-08-21 08:07 bwarsaw
* split.py (1.3):
Guido suggests a different order for the positional args.
2002-08-21 07:37 bwarsaw
* split.py (1.2):
Get rid of the -1 and -2 arguments and make them positional.
2002-08-21 07:18 bwarsaw
* split.py (1.1):
A simple mailbox splitter
2002-08-21 06:42 tim_one
* GBayes.py (1.8):
Added a bunch of simple tokenizers. The originals are renamed to
tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
tokenize_5gram, tokenize_10gram, and tokenize_15gram. I don't expect
any of these to be the last word. When Barry has the test corpus
set up it should be easy to let the data tell us which "pure" strategy
works best. Straight character n-grams are very appealing because
they're the simplest and most language-neutral; I didn't have any luck
with them over the weekend, but the size of my training data was
trivial.
2002-08-21 05:08 bwarsaw
* cleanarch (1.1):
An archive cleaner, adapted from the Mailman 2.1b3 version, but
de-Mailman-ified.
2002-08-21 04:44 gvanrossum
* GBayes.py (1.7):
Indent repair in clearjunk().
2002-08-21 04:22 gvanrossum
* GBayes.py (1.6):
Some minor cleanup:
- Move the identifying comment to the top, clarify it a bit, and add
author info.
- There's no reason for _time and _heapreplace to be hidden names;
change these back to time and heapreplace.
- Rename main1() to _test() and main2() to main(); when main() sees
there are no options or arguments, it runs _test().
- Get rid of a list comprehension from clearjunk().
- Put wordinfo.get as a local variable in _add_msg().
2002-08-20 15:16 tim_one
* GBayes.py (1.5):
Neutral typo repairs, except that clearjunk() has a better chance of
not blowing up immediately now .
2002-08-20 13:49 montanaro
* GBayes.py (1.4):
help make it more easily executable... ;-)
2002-08-20 09:32 bwarsaw
* GBayes.py (1.3):
Lots of hacks great and small to the main() program, but I didn't
touch the guts of the algorithm.
Added a module docstring/usage message.
Added a bunch of switches to train the system on an mbox of known good
and known spam messages (using PortableUnixMailbox only for now).
Uses the email package but does not decoding of message bodies. Also,
allows you to specify a file for pickling the training data, and for
setting a threshold, above which messages get an X-Bayes-Score
header. Also output messages (marked and unmarked) to an output file
for retraining.
Print some statistics at the end.
2002-08-20 05:43 tim_one
* GBayes.py (1.2):
Turned off debugging vrbl mistakenly checked in at True.
unlearn(): Gave this an update_probabilities=True default arg, for
symmetry with learn().
2002-08-20 03:33 tim_one
* GBayes.py (1.1):
An implementation of Paul Graham's Bayes-like spam classifier.
|