From 61e50a6dccf3d2e32ebfa2ace3909122e1785249 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Wed, 20 May 2020 16:41:57 -0400 Subject: [PATCH] note issues with the script --- tsa/howto/rt.mdwn | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/tsa/howto/rt.mdwn b/tsa/howto/rt.mdwn index 129918d5..9bddc45a 100644 --- a/tsa/howto/rt.mdwn +++ b/tsa/howto/rt.mdwn @@ -255,3 +255,28 @@ folder, moving it to `.spam.learned` or `.xham.learned` when done. Then, interestingly, those emails are destroyed. It's unclear why that is not done in the `spam-learn` step directly. + +### Possible improvements + +The above design has a few problems: + + 1. it assumes "ham" queues are named "help-*" - but there are other + queues in the system + 2. it might be slow: if there are lots of emails to process, it will + do an SQL query for each and a move, and not all at once + 3. it is split over multiple shell scripts, not versioned + +I would recommend the following: + + 1. reverse the logic of the queue checks: instead of checking for + folders and queues named `help-*`, check if the folders or queues + are *not* named `spam*` or `xham*` + 2. batch jobs: use a generator to yield Message-Id, then pick a + certain number of emails and batch-send them to psql and the + rename + 3. do all operations at once: look in psql, move the files in the + learning folder, and train, possibly in parallel, but at least all + in the same script + 4. sa-learn can read from a folder now, so there's no need for that + wrapper shell script in any case + 5. commit the script to version control and, even better, puppet -- GitLab