Skip to content
Snippets Groups Projects
Verified Commit 6d9329b7 authored by anarcat's avatar anarcat
Browse files

merge the spam filter docs

parent d52f2f3c
No related branches found
No related tags found
No related merge requests found
......@@ -303,18 +303,6 @@ RT articles are dumped into text files and then pushed to the [rt-articles Git r
The machinery is [spread through several scripts](https://gitweb.torproject.org/support-tools.git/tree/HEAD:/rt-articles). The one run on `rude` is [dump_rt_articles](https://gitweb.torproject.org/support-tools.git/blob/HEAD:/rt-articles/dump_rt_articles), and it will run everyday through a _cronjob_ as user `colin`.
## Spam training
Every mail sent to RT is also sent to the `rtmailarchive` account. This is required to be able to train SpamAssassin as it can only learn from unaltered email messages.
A three steps _cronjob_ is run daily.
[Step 1](https://gitweb.torproject.org/support-tools.git/blob/HEAD:/train-spam-filters/train_spam_filters): Every mail in `Maildir/.help*` is checked against the RT. For each message, we look up a matching ticket using the Message-Id header. If the ticket is in a `help*` queue and has status `resolved`, we move it to the ham training folder. If the ticket in in the `spam` queue and has status `resolved`, we move it to the spam training folder. If the file is more than 100 days old, we delete it.
Step 2: SpamAssassin is fed with the content of the ham and spam training folder. After the process, the message is moved to the corresponding `learned` folder.
Step 3: Message in the `learned` folders are deleteed.
# Discussion
## Spam filter training design
......@@ -329,140 +317,16 @@ this:
/srv/rtstuff/support-tools/train-spam-filters/train_spam_filters && bin/spam-learn && find Maildir/.spam.learned Maildir/.xham.learned -type f -delete
The first part is the following Python script (from rude):
The [train_spam_filters](https://gitweb.torproject.org/support-tools.git/tree/train-spam-filters/train_spam_filters) script basically does this:
#!/usr/bin/python
#
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, as published by Sam Hocevar. See
# http://sam.zoy.org/wtfpl/COPYING for more details.
from __future__ import print_function
import email.parser
import psycopg2
import os
import os.path
from datetime import datetime, timedelta
DEBUG = False
MAILDIR_ROOT = os.path.join(os.environ['HOME'], 'Maildir')
SPAM_MAILDIR = '.spam.learn'
HAM_MAILDIR = '.xham.learn'
KEEP_FOR_MAX_DAYS = 100
RT_CONNINFO = "host=localhost sslmode=require user=rtreader dbname=rt"
SELECT_HAM_TICKET_QUERY = """
SELECT DISTINCT Tickets.Id
FROM Queues, Tickets, Transactions
LEFT OUTER JOIN Attachments ON Attachments.TransactionId = Transactions.Id
WHERE Queues.Name LIKE 'help%%'
AND Tickets.Queue = Queues.Id
AND Tickets.Status = 'resolved'
AND Transactions.ObjectId = Tickets.Id
AND Transactions.ObjectType = 'RT::Ticket'
AND Attachments.MessageId = %s;
"""
SELECT_SPAM_TICKET_QUERY = """
SELECT DISTINCT Tickets.Id
FROM Queues, Tickets, Transactions
LEFT OUTER JOIN Attachments ON Attachments.TransactionId = Transactions.Id
WHERE Queues.Name = 'spam'
AND Tickets.Queue = Queues.Id
AND Tickets.Status = 'rejected'
AND Transactions.ObjectId = Tickets.Id
AND Transactions.ObjectType = 'RT::Ticket'
AND Attachments.MessageId = %s;
"""
EMAIL_PARSER = email.parser.Parser()
if DEBUG:
def log(msg):
print(msg)
else:
def log(msg):
pass
def is_ham(msg_id):
global con
cur = con.cursor()
try:
cur.execute(SELECT_HAM_TICKET_QUERY, (msg_id,))
return cur.fetchone() is not None
finally:
cur.close()
def is_spam(msg_id):
global con
cur = con.cursor()
try:
cur.execute(SELECT_SPAM_TICKET_QUERY, (msg_id,))
return cur.fetchone() is not None
finally:
cur.close()
def handle_message(path):
msg = EMAIL_PARSER.parse(open(path), headersonly=True)
msg_id = msg['Message-Id']
if not msg_id.startswith('<') or not msg_id.endswith('>'):
log("%s: bad Message-Id, removing." % path)
os.unlink(path)
return
msg_id = msg_id[1:-1]
if is_ham(msg_id):
os.rename(path, os.path.join(MAILDIR_ROOT, HAM_MAILDIR, 'cur', os.path.basename(path)))
log("%s: ham, moving." % path)
return
if is_spam(msg_id):
os.rename(path, os.path.join(MAILDIR_ROOT, SPAM_MAILDIR, 'cur', os.path.basename(path)))
log("%s: spam, moving." % path)
return
mtime = datetime.fromtimestamp(os.stat(path).st_mtime)
limit = datetime.now() - timedelta(days=KEEP_FOR_MAX_DAYS)
if mtime <= limit:
log("%s: too old, removing." % path)
os.unlink(path)
return
# well, it's not identified ham, not identified spam, and not too old
# let's keep the message for now
log("%s: unknown, keeping." % path)
def scan_directory(dir_path):
for filename in os.listdir(dir_path):
path = os.path.join(dir_path, filename)
handle_message(path)
con = None
if __name__ == '__main__':
con = psycopg2.connect(RT_CONNINFO)
for filename in os.listdir(MAILDIR_ROOT):
if filename.startswith('.help'):
for subdir in ['new', 'cur', 'tmp']:
scan_directory(os.path.join(MAILDIR_ROOT, filename, subdir))
con.close()
It is unclear if this program was written for TPO or if it comes from
elsewhere. It is included here for external reference but might have
changed since this documentation was written. What it does is,
basically:
1. for each mail in the archive
1. for each mail in the `Maildir/.help*` archive
2. find its Message-Id header
3. load the equivalent message from RT:
* if it is in the Spam queue, marked as "Rejected", it is spam.
* if it is in a help-* queue, marked as "Resolved", it is ham.
4. move the email in the right directory mail folder (.spam.learn,
.xham.learn) depending on status
4. move the email in the right directory mail folder (`.spam.learn`,
`.xham.learn`) depending on status
5. if the file is more than 100 days old, delete it.
Then the rest of the cron job continues. `spam-learn` is this shell
script:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment