@@ -303,18 +303,6 @@ RT articles are dumped into text files and then pushed to the [rt-articles Git r
The machinery is [spread through several scripts](https://gitweb.torproject.org/support-tools.git/tree/HEAD:/rt-articles). The one run on `rude` is [dump_rt_articles](https://gitweb.torproject.org/support-tools.git/blob/HEAD:/rt-articles/dump_rt_articles), and it will run everyday through a _cronjob_ as user `colin`.
## Spam training
Every mail sent to RT is also sent to the `rtmailarchive` account. This is required to be able to train SpamAssassin as it can only learn from unaltered email messages.
A three steps _cronjob_ is run daily.
[Step 1](https://gitweb.torproject.org/support-tools.git/blob/HEAD:/train-spam-filters/train_spam_filters): Every mail in `Maildir/.help*` is checked against the RT. For each message, we look up a matching ticket using the Message-Id header. If the ticket is in a `help*` queue and has status `resolved`, we move it to the ham training folder. If the ticket in in the `spam` queue and has status `resolved`, we move it to the spam training folder. If the file is more than 100 days old, we delete it.
Step 2: SpamAssassin is fed with the content of the ham and spam training folder. After the process, the message is moved to the corresponding `learned` folder.
Step 3: Message in the `learned` folders are deleteed.
# Discussion
## Spam filter training design
...
...
@@ -329,140 +317,16 @@ this:
/srv/rtstuff/support-tools/train-spam-filters/train_spam_filters && bin/spam-learn && find Maildir/.spam.learned Maildir/.xham.learned -type f -delete
The first part is the following Python script (from rude):
The [train_spam_filters](https://gitweb.torproject.org/support-tools.git/tree/train-spam-filters/train_spam_filters) script basically does this:
#!/usr/bin/python
#
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, as published by Sam Hocevar. See
# http://sam.zoy.org/wtfpl/COPYING for more details.