merge the spam filter docs

6d9329b7 · anarcat · d52f2f3c · 6d9329b7
Verified Commit 6d9329b7 authored 4 years ago by anarcat
--- a/howto/rt.md
+++ b/howto/rt.md
@@ -303,18 +303,6 @@ RT articles are dumped into text files and then pushed to the [rt-articles Git r

 The machinery is [spread through several scripts](https://gitweb.torproject.org/support-tools.git/tree/HEAD:/rt-articles). The one run on `rude` is [dump_rt_articles](https://gitweb.torproject.org/support-tools.git/blob/HEAD:/rt-articles/dump_rt_articles), and it will run everyday through a _cronjob_ as user `colin`.

-## Spam training
-
-Every mail sent to RT is also sent to the `rtmailarchive` account. This is required to be able to train SpamAssassin as it can only learn from unaltered email messages.
-
-A three steps _cronjob_ is run daily.
-
-[Step 1](https://gitweb.torproject.org/support-tools.git/blob/HEAD:/train-spam-filters/train_spam_filters): Every mail in `Maildir/.help*` is checked against the RT. For each message, we look up a matching ticket using the Message-Id header. If the ticket is in a `help*` queue and has status `resolved`, we move it to the ham training folder. If the ticket in in the `spam` queue and has status `resolved`, we move it to the spam training folder. If the file is more than 100 days old, we delete it.
-
-Step 2: SpamAssassin is fed with the content of the ham and spam training folder. After the process, the message is moved to the corresponding `learned` folder.
-
-Step 3: Message in the `learned` folders are deleteed.
-
 # Discussion

 ## Spam filter training design
@@ -329,140 +317,16 @@ this:

    /srv/rtstuff/support-tools/train-spam-filters/train_spam_filters && bin/spam-learn && find Maildir/.spam.learned Maildir/.xham.learned -type f -delete

-The first part is the following Python script (from rude):
+The [train_spam_filters](https://gitweb.torproject.org/support-tools.git/tree/train-spam-filters/train_spam_filters) script basically does this:

-    #!/usr/bin/python
-    #
-    # This program is free software. It comes without any warranty, to
-    # the extent permitted by applicable law. You can redistribute it
-    # and/or modify it under the terms of the Do What The Fuck You Want
-    # To Public License, Version 2, as published by Sam Hocevar. See
-    # http://sam.zoy.org/wtfpl/COPYING for more details.
-    
-    from __future__ import print_function
-    
-    import email.parser
-    import psycopg2
-    import os
-    import os.path
-    from datetime import datetime, timedelta
-    
-    DEBUG = False
-    
-    MAILDIR_ROOT = os.path.join(os.environ['HOME'], 'Maildir')
-    SPAM_MAILDIR = '.spam.learn'
-    HAM_MAILDIR = '.xham.learn'
-    
-    KEEP_FOR_MAX_DAYS = 100
-    
-    RT_CONNINFO = "host=localhost sslmode=require user=rtreader dbname=rt"
-    
-    SELECT_HAM_TICKET_QUERY = """
-        SELECT DISTINCT Tickets.Id
-          FROM Queues, Tickets, Transactions
-               LEFT OUTER JOIN Attachments ON Attachments.TransactionId = Transactions.Id
-         WHERE Queues.Name LIKE 'help%%'
-           AND Tickets.Queue = Queues.Id
-           AND Tickets.Status = 'resolved'
-           AND Transactions.ObjectId = Tickets.Id
-           AND Transactions.ObjectType = 'RT::Ticket'
-           AND Attachments.MessageId = %s;
-    """
-    
-    SELECT_SPAM_TICKET_QUERY = """
-        SELECT DISTINCT Tickets.Id
-          FROM Queues, Tickets, Transactions
-               LEFT OUTER JOIN Attachments ON Attachments.TransactionId = Transactions.Id
-         WHERE Queues.Name = 'spam'
-           AND Tickets.Queue = Queues.Id
-           AND Tickets.Status = 'rejected'
-           AND Transactions.ObjectId = Tickets.Id
-           AND Transactions.ObjectType = 'RT::Ticket'
-           AND Attachments.MessageId = %s;
-    """
-    
-    EMAIL_PARSER = email.parser.Parser()
-    
-    if DEBUG:
-        def log(msg):
-            print(msg)
-    else:
-        def log(msg):
-            pass
-    
-    def is_ham(msg_id):
-        global con
-    
-        cur = con.cursor()
-        try:
-            cur.execute(SELECT_HAM_TICKET_QUERY, (msg_id,))
-            return cur.fetchone() is not None
-        finally:
-            cur.close()
-    
-    def is_spam(msg_id):
-        global con
-    
-        cur = con.cursor()
-        try:
-            cur.execute(SELECT_SPAM_TICKET_QUERY, (msg_id,))
-            return cur.fetchone() is not None
-        finally:
-            cur.close()
-    
-    def handle_message(path):
-        msg = EMAIL_PARSER.parse(open(path), headersonly=True)
-        msg_id = msg['Message-Id']
-        if not msg_id.startswith('<') or not msg_id.endswith('>'):
-            log("%s: bad Message-Id, removing." % path)
-            os.unlink(path)
-            return
-        msg_id = msg_id[1:-1]
-        if is_ham(msg_id):
-            os.rename(path, os.path.join(MAILDIR_ROOT, HAM_MAILDIR, 'cur', os.path.basename(path)))
-            log("%s: ham, moving." % path)
-            return
-        if is_spam(msg_id):
-            os.rename(path, os.path.join(MAILDIR_ROOT, SPAM_MAILDIR, 'cur', os.path.basename(path)))
-            log("%s: spam, moving." % path)
-            return
-        mtime = datetime.fromtimestamp(os.stat(path).st_mtime)
-        limit = datetime.now() - timedelta(days=KEEP_FOR_MAX_DAYS)
-        if mtime <= limit:
-            log("%s: too old, removing." % path)
-            os.unlink(path)
-            return
-        # well, it's not identified ham, not identified spam, and not too old
-        # let's keep the message for now
-        log("%s: unknown, keeping." % path)
-    
-    def scan_directory(dir_path):
-        for filename in os.listdir(dir_path):
-            path = os.path.join(dir_path, filename)
-            handle_message(path)
-            
-    con = None
-    
-    if __name__ == '__main__':
-        con = psycopg2.connect(RT_CONNINFO)
-        for filename in os.listdir(MAILDIR_ROOT):
-            if filename.startswith('.help'):
-                for subdir in ['new', 'cur', 'tmp']:
-                    scan_directory(os.path.join(MAILDIR_ROOT, filename, subdir))
-        con.close()
-
-It is unclear if this program was written for TPO or if it comes from
-elsewhere. It is included here for external reference but might have
-changed since this documentation was written. What it does is,
-basically:
-
- 1. for each mail in the archive
+ 1. for each mail in the `Maildir/.help*` archive
 2. find its Message-Id header
 3. load the equivalent message from RT:
    * if it is in the Spam queue, marked as "Rejected", it is spam.
    * if it is in a help-* queue, marked as "Resolved", it is ham.
- 4. move the email in the right directory mail folder (.spam.learn,
-    .xham.learn) depending on status
+ 4. move the email in the right directory mail folder (`.spam.learn`,
+    `.xham.learn`) depending on status
+ 5. if the file is more than 100 days old, delete it.

 Then the rest of the cron job continues. `spam-learn` is this shell
 script: