Verified Commit d961a32a authored by anarcat's avatar anarcat
Browse files

move the history & design sections to the TPA wiki

We have service docs that are way more extensive, meaningful and
relevant than keeping those historical details in the project's README
file.

Plus it would be silly to keep those things duplicated between the two
places...
parent cfb4ad75
......@@ -173,93 +173,6 @@ idea](https://github.com/firstlookmedia/dangerzone/issues/110) which is still un
[dangerzone]: https://dangerzone.rocks/
# History
# History & Design
I was involved in the hiring of two new sysadmins at the Tor Project
in spring 2021. To avoid untrusted inputs (i.e. random PDF files from
the internet) being open by the hiring committee, we had a tradition
of having someone sanitize those in a somewhat secure environment,
which was typically some Qubes user doing ... whatever it is Qubes
user do.
Then when a new hiring process started, people asked me to do it
again. At that stage, I had expected this to happen, so I partially
automated this as a [pull request against the dangerzone project](https://github.com/firstlookmedia/dangerzone-converter/pull/7),
which grew totally out hand. The automation wasn't quite complete
though: i still had to upload the files to the sanitizing server, run
the script, copy the files back, and upload them into Nextcloud.
But by then people started to think I had magically and fully
automated the document sanitization routine (hint: not quite!), so I
figured it was important to realize that dream and complete the work
so that I didn't have to sit there manually copying files around.
# Design
This section discusses how this tool was built, more or less. It's
mostly relevant for historical reasons.
## Manual process
The partial automation process mentioned above was:
1. get emails in my regular tor inbox with attachments
2. wait a bit to have some accumulate
3. save them to my local hard drive, in a `dangerzone` folder
4. rsync that to a remote virtual machine
5. run a modified version of the [`dangerzone-converter`][] to save
files in a "`safe`" folder (see [batch-convert](https://github.com/anarcat/dangerzone-converter/blob/6a37f48dec67412c44f8814f122ea7977a658334/batch-convert.py) in [PR 7](https://github.com/firstlookmedia/dangerzone-converter/pull/7))
6. rsync the files back to my local computer
7. upload the files into some Nextcloud folder
[`dangerzone-converter`]: https://github.com/firstlookmedia/dangerzone-converter/
## New mechanism
This is more or less how the script works:
1. periodically check a Nextcloud (WebDAV) folder (called
`dangerzone`) for new files
2. when a file is found, move it to a `dangerzone/processing` folder
as an ad-hoc locking mechanism
3. download the file locally
4. process the file with the [`dangerzone-converter`][] container
5. on failure, delete the failed file locally, and move it to a
`dangerzone/rejected` folder remotely
6. on success, upload the sanitized file to a `safe/` folder, move
the original to `dangerzone/processed`
The only bit missing from the previous prototype was the WebDAV parts.
## Rejected process
This alternative process was also suggested:
1. candidates submit their resumes by email
2. the program gets a copy by email
3. the program sanitizes the attachment
4. the program assigns a unique ID and name for that user
(e.g. Candidate 10 Alice Doe)
5. the program uploads the sanitized attachment in a Nextcloud folder
named after the unique ID
My concern with that approach was that it exposes the sanitization
routines to the world, which opens the door to Denial of service
attacks, at the very least. Someone could flood the disk by sending a
massive number of resumes, for example. I could also think of ZIP
bombs that could have "fun" consequences.
By putting a user between the world and the script, we have some
ad-hoc moderation that alleviates that issues, and also ensures a
human-readable, meaningful identity can be attached with each
submission (say: "this is Candidate 7 for job posting foo").
The above would also not work with resumes submitted through other
platforms (e.g. Indeed.com), unless an operator re-injects the resume,
which might make the unique ID creation harder (because the From will
be the operator, not the candidate).
This documentation has been moved to the [TPA service page](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/dangerzone).
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment