Verified Commit 5218bfee authored by anarcat's avatar anarcat
Browse files

start documenting the processor

See: #6

Partly based on notes from
team#40256
parent 2a5b369b
The Dangerzone WebDAV processor is a program that will process files
on a WebDAV server with [dangerzone](https://dangerzone.rocks/) (only the "server-side" of it)
and render sanitized versions of untrusted files.
It is used by the Tor Project to sanitized files sent during hires.
# Usage
*This section assumes some administrator already has the processor
setup under the `dangerzone-bot` user, see the Installation section
below for information on how to set that up.*
Say you are someone receiving resumes or whatever untrusted content
and you actually *need* to open those files because that's part of
your job. What do you do?
1. I make a folder in Nextcloud
2. I upload the untrusted file(s) in the folder
3. I share the folder with the `dangerzone-bot` user
4. after a short delay, the file *disappears* (gasp! do not worry,
they actually are moved to the `dangerzone-processing/` folder)
5. then after another delay, if "dangerzone" succeeded, sanitized
appear in a `safe/` folder and the original files are moved into a
`dangerzone-processed/` folder, telling me they have been correctly
processed.
6. if that didn't work, they end up in `dangerzone-rejected/` and no
new file appear in the `safe/` folder
# Installation
## Requirements
To deploy this on your infrastructure (or setup a development
environment), you will need the following requirements:
* an account on a Nextcloud server (your own)
* (optionally) a second account on the nextcloud server (for the bot)
* a running Docker server that you have access to (on Debian, that
means being part of the `docker` group), which can fetch (or have a
local copy of) the [dangerzone-converter](https://github.com/firstlookmedia/dangerzone-converter) image
* a `python3` command
* the [webdavclient][] Python library (`python3-webdavclient` in
Debian)
[webdavclient]: https://pypi.org/project/webdavclient/
## Nextcloud app password setup
Once you have all the above requirements, you need to do this in
Nextcloud.
This section is actually optional: you can use your own account, with
the normal password, but that's not recommended: you can setup
application-specific passwords and you should.
First, if you have admin access to Nextcloud, create a
`dangerzone-bot` account to process the content shared by other
users. It can be named something else, and you can use your own
account as well, but it's preferable to use a role account in case you
go away. (The point of this, after all, is that this stops depending
on you.)
Then login to Nextcloud using the role account (if created above) or
your own account and setup an "app password" (optional, but strongly
recommended):
1. click on your avatar on the top-right
2. pick the `Security` tab
3. scroll down to `Devices & sessions`
4. fill in `dangerzone-webdav-processor` as an App name
5. hit `Create new app password`
6. copy the password (it will disappear in the next step)
7. hit `Done`
You now have a role account (or your own account) setup with a
password specific for your bot.
## Running the script by hand
Once you have a role account and password (or if you like to live
dangerously and just want to use your own account and password), you
can start processing *all* the files at the root of that account
using:
export WEBDAV_PASSWORD=[REDACTED]
./processor.py --username dangerzone-bot --location https://example.com/remote.php/dav/files/dangerzone-bot/ -v
Obviously, the `--username` and `--location` parameters must be
adapted to your configuration. The latter, in particular, can be found
in the `Files` dialog, under `Settings` (at the bottom left), in the
`WebDAV` section.
The above will process *all* folders (except the special ones) and, on
success, dump the sanitized files in the `safe/` special folder.
# History
I was involved in the hiring of two new sysadmins at the Tor Project
in spring 2021. To avoid untrusted inputs (i.e. random PDF files from
the internet) being open by the hiring committee, we had a tradition
of having someone sanitize those in a somewhat secure environment,
which was typically some Qubes user doing ... whatever it is Qubes
user do.
Then when a new hiring process started, people asked me to do it
again. At that stage, I had expected this to happen, so I partially
automated this as a [pull request against the dangerzone project](https://github.com/firstlookmedia/dangerzone-converter/pull/7),
which grew totally out hand. The automation wasn't quite complete
though: i still had to upload the files to the sanitizing server, run
the script, copy the files back, and upload them into Nextcloud.
But by then people started to think I had magically and fully
automated the document sanitization routine (hint: not quite!), so I
figured it was important to realize that dream and complete the work
so that I didn't have to sit there manually copying files around.
# Design
## Manual process
The partial automation process mentioned above was:
1. get emails in my regular tor inbox with attachments
2. wait a bit to have some accumulate
3. save them to my local hard drive, in a `dangerzone` folder
4. rsync that to a remote virtual machine
5. run a modified version of the [`dangerzone-converter`][] to save
files in a "`safe`" folder (see [batch-convert](https://github.com/anarcat/dangerzone-converter/blob/6a37f48dec67412c44f8814f122ea7977a658334/batch-convert.py) in [PR 7](https://github.com/firstlookmedia/dangerzone-converter/pull/7))
6. rsync the files back to my local computer
7. upload the files into some Nextcloud folder
[`dangerzone-converter`]: https://github.com/firstlookmedia/dangerzone-converter/
## New mechanism
This is more or less how the script works:
1. periodically check a Nextcloud (WebDAV) folder (called
`dangerzone`) for new files
2. when a file is found, move it to a `dangerzone/processing` folder
as an ad-hoc locking mechanism
3. download the file locally
4. process the file with the [`dangerzone-converter`][] container
5. on failure, delete the failed file locally, and move it to a
`dangerzone/rejected` folder remotely
6. on success, upload the sanitized file to a `safe/` folder, move
the original to `dangerzone/processed`
The only bit missing from the previous prototype was the WebDAV parts.
## Rejected process
This alternative process was also suggested:
1. candidates submit their resumes by email
2. the program gets a copy by email
3. the program sanitizes the attachment
4. the program assigns a unique ID and name for that user
(e.g. Candidate 10 Alice Doe)
5. the program uploads the sanitized attachment in a Nextcloud folder
named after the unique ID
My concern with that approach was that it exposes the sanitization
routines to the world, which opens the door to Denial of service
attacks, at the very least. Someone could flood the disk by sending a
massive number of resumes, for example. I could also think of ZIP
bombs that could have "fun" consequences.
By putting a user between the world and the script, we have some
ad-hoc moderation that alleviates that issues, and also ensures a
human-readable, meaningful identity can be attached with each
submission (say: "this is Candidate 7 for job posting foo").
The above would also not work with resumes submitted through other
platforms (e.g. Indeed.com), unless an operator re-injects the resume,
which might make the unique ID creation harder (because the From will
be the operator, not the candidate).
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment