From ade260c750e221509c69e2b59222bd93f1e09d7f Mon Sep 17 00:00:00 2001
From: "Iain R. Learmonth" <irl@torproject.org>
Date: Thu, 2 Apr 2020 18:54:58 +0100
Subject: [PATCH] metrics: add org sources to go with html

---
 metrics/ops/exit-ops.org      | 182 +++++++++++++++++++++++
 metrics/ops/metrics-cloud.org | 269 ++++++++++++++++++++++++++++++++++
 2 files changed, 451 insertions(+)
 create mode 100644 metrics/ops/exit-ops.org
 create mode 100644 metrics/ops/metrics-cloud.org

diff --git a/metrics/ops/exit-ops.org b/metrics/ops/exit-ops.org
new file mode 100644
index 00000000..1e2f8e13
--- /dev/null
+++ b/metrics/ops/exit-ops.org
@@ -0,0 +1,182 @@
+#+OPTIONS: broken-links:t
+
+* TODO Name
+
+**exit-ops** - Exit Scanner, TorDNSEL and Tor Check Operations
+
+* TODO Synopsis
+
+While the three services described in this document could be implemented as discrete components,
+they currently have tight coupling which means they must all be deployed on the same host.
+
+** TODO Exit Scanner [0/3]
+
+The exit scanner performs active measurement of Tor exit relays in order to determine the IP addresses that are used for exit connections.
+The active measurement uses an exitmap module, which is wrapped in a script to produce output formatted as an [Exit List](https://metrics.torproject.org/collector.html#type-tordnsel).
+
+The exit list results are consumed by CollecTor, [TorDNSEL](tordnsel) and [Tor Check](../check-ops/).
+Exit lists and bulk exit lists are also consumed by third-party external applications at the following URLs:
+
+- https://check.torproject.org/exit-addresses - Latest exit list
+- https://check.torproject.org/torbulkexitlist - Latest bulk exit list
+
+Documentation questions:
+
+- [ ] How long do we keep old measurements in the exit list?
+- [ ] What are the timings for measurement runs?
+- [ ] How many old exit lists do we keep around?
+
+** TODO TorDNSEL [0/2]
+
+TorDNSEL is a DNS list service that behaves in a similar way to [[https://en.wikipedia.org/wiki/Domain_Name_System-based_Blackhole_List][Domain Name System-based Blackhole Lists]].
+IP addresses will give positive results in the event that an address has been found to be used by an exit relay in a recent scan.
+
+Documentation questions:
+
+- [ ] For how long does an address give a positive result?
+- [ ] Do we also include all IP addresses of exit flagged relays in the consensus?
+
+** TODO Tor Check [0/1]
+
+Tor Check is a website that can be used to determine if a browser is using the Tor network for queries.
+It will also check the User-Agent to determine if a user is using Tor Browser.
+It is accessed via HTTPS at https://check.torproject.org/.
+
+Documentation questions:
+
+- [ ] Where is the JSON API?
+
+* DONE Contacts
+
+The primary contact for this service is the Metrics Team <[[mailto:metrics-team@lists.torproject.org][metrics-team@lists.torproject.org]]>.
+For urgent queries, contact *karsten*, *irl*, or *gaba* in [[ircs://irc.oftc.net:6697/tor-project][#tor-project]].
+
+* TODO Overview
+
+The underlying infrastructure for the exit scanner, TorDNSEL and Tor Check services is provided by the
+Tor Sysadmin Team (TSA). All services run on one virtual machine with the hostname ~check-01.torproject.org~.
+
+** TODO Exit Scanner
+
+Documentation questions:
+
+- [ ] Where is the exitmap module?
+- [ ] What are the services called?
+- [ ] What user is used?
+
+** TODO TorDNSEL
+
+Documentation questions:
+
+- [ ] Where does the zone file live?
+- [ ] Ticket about doing DNSSEC signing
+- [ ] Where is DNS served
+- [ ] What name is delegated
+- [ ] Can delegation work in testing environment?
+
+* DONE Sources
+
+The sources for exitmap are available on GitHub: https://github.com/NullHypothesis/exitmap.
+The [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/exit-scanner/files/exitscan.py][exitmap wrapper]] and [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/exit-scanner/files/ipscan.py][module]] used by the exit scanner can be found in the metrics-cloud repository.
+
+The wrapper script is also responsible for writing out the zone file to be used by the TorDNSEL service
+and triggering a reload of the zone.
+
+The sources for Tor Check are available in our git: https://gitweb.torproject.org/check.git.
+
+* TODO Deployment
+
+** DONE Initial deployment
+
+The initial deployment procedure is split into 2 parts:
+
+- System setup
+- Installing and starting the services
+
+There are no manual steps required to load state, and backups do not need to be performed for the host running this service.
+Everything can be configured from scratch with only the Ansible playbook.
+
+*** DONE Development/testing in AWS
+
+For development or testing in AWS, a CloudFormation template is available named [[https://gitweb.torproject.org/metrics-cloud.git/plain/cloudformation/exit-scanner-dev.yml][~exit-scanner-dev.yml~]].
+
+From the [[https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks][CloudFormation portal]], select your stack and view the outputs.
+You will find here the public IP address for the EC2 instance that has been created.
+Add this instance to *ansible/dev* in your local copy of metrics-cloud.git under "[exit-scanners]".
+
+You can now setup the machine with Ansible by running:
+
+```
+ansible-playbook -i dev exit-scanners-aws.yml
+```
+
+Note that the AWS AMI used has passwordless sudo, so no password need be given.
+
+*** DONE Fresh machine from TSA
+
+Add the host name of the new instance to *ansible/production* in your local
+copy of metrics-cloud.git under "[exit-scanners]" and commit the change.
+
+You can now setup the machine with Ansible by running:
+
+```
+ansible-playbook -i production -K exit-scanners.yml
+```
+
+** TODO Upgrade [0/2]
+
+The upstream sources for the applications that make up this service do not have managed releases
+which makes this difficult.
+
+To fix a bug in the exit scanner wrapper script, fix this in the metrics-cloud repository and re-run
+the deployment playbook.
+
+- [ ] Can we upgrade exitmap sensibly?
+- [ ] Can we upgrade Tor Check sensibly?
+
+* TODO Diagnostics
+
+** TODO Logs [0/2]
+
+- [ ] What things log?
+- [ ] Where do the logs go?
+
+* TODO Monitoring [0/2]
+
+- [ ] CollecTor log messages
+- [ ] Nagios
+
+* DONE Disaster Recovery
+
+The exit scanner service does not need to maintain any state between runs.
+It's nice if it can in order to cope with a relay that happened to be down at the time we tried to measure
+it but in the event of a failure it is perfectly acceptable to throw away the old box and provision a new one.
+Follow the initial deployment instructions above.
+
+* TODO Service Level Agreement
+
+* TODO See Also
+
+* TODO Standards
+
+The exit scanner service produces exit lists according to the [[https://2019.www.torproject.org/tordnsel/exitlist-spec.txt][TorDNSEL exit list format]].
+
+* TODO History
+
+* TODO Authors
+
+* DONE Major Caveats
+
+The exit scanner service does not support IPv6.
+
+* DONE Bugs
+
+Known bugs can be found in the Tor Project Trac for:
+
+- [[https://trac.torproject.org/projects/tor/query?status=!closed&component=Metrics%2FExit%20Scanner][Exit Scanner]]
+- [[https://trac.torproject.org/projects/tor/query?status=!closed&component=Applications%2FTor%20Check][Tor Check]]
+
+For bugs relating to exitmap, they are found on the GitHub project: https://github.com/NullHypothesis/exitmap/issues
+
+New bug reports should be filed in the appropriate tracker and component.
+
diff --git a/metrics/ops/metrics-cloud.org b/metrics/ops/metrics-cloud.org
new file mode 100644
index 00000000..17eb3752
--- /dev/null
+++ b/metrics/ops/metrics-cloud.org
@@ -0,0 +1,269 @@
+#+TITLE: metrics-cloud: Scripts for orchestrating Tor Metrics services
+#+OPTIONS: ^:nil
+
+* DONE Synopsis
+
+The metrics-cloud framework aims to enable:
+
+- reproducible deployments of software
+- consistency between those software deployments
+
+Side-effects of these goals are:
+
+- reproducible experiments (good science)
+- reduced maintainence costs
+- reduced human error
+
+There are currently two components to the metrics-cloud framework: CloudFormation templates and Ansible playbooks.
+The CloudFormation templates are relevant only to testing and development, while the Ansible playbooks are applicable
+to both environments.
+
+* DONE Usage of AWS for Tor Metrics Development
+
+Each member of the Tor Metrics team has a standing allowance of 100USD/month for development using AWS. In practice,
+we have not used more than 50USD/month for the team in any one month and generally sit around 25USD/month. It is
+still important to minimize costs when using AWS and the use of CloudFormation templates and Ansible playbooks for
+rapid creation, provisioning and destruction should help with this.
+
+** DONE CloudFormation Templates
+
+CloudFormation is an AWS service allowing the definition of /stacks/. These stacks describe a series of AWS services
+using a domain-specific language and allow for the easy creation of a number of interconnected resources. All resources
+in a stack are tagged with the stack name which allows for tracking of costs per project. Each stack can also have all
+resources terminated together easily, allowing stacks to exist for only as long as they are needed.
+
+The CloudFormation templates used in the framework can be found in the [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation][cloudformation]] folder of the repository.
+
+It may be that for some services the templates are very simple, and others may be more complex. No matter the level of
+complexity we still want to use the templates to ensure we are meeting the key goals of the framework and also to simplify
+tracking of spending in the billing portal through the tags.
+
+Documentation for CloudFormation, including an API reference, can be found at: https://docs.aws.amazon.com/cloudformation/.
+
+*** DONE Quickstart: Deploying a template
+
+Each template begins with comments with any relevant notes about the template, and a deployment command that will upload
+and deploy the template on AWS. The commands will look something like:
+
+#+BEGIN_SRC shell
+aws cloudformation deploy --region us-east-1 --stack-name `whoami`-exit-scanner-dev --template-file exit-scanner-dev.yml --parameter-overrides myKeyPair="$(./identify_user.sh)"
+#+END_SRC
+
+You'll notice that the command includes a call to ~whoami~ to prefix the stack name with your current username, and also
+that the ~identify_user.sh~ script is used to determine which SSH key to use for new instances.
+You do not need to modify this command line before running it.
+
+Once the stack has been deployed from the template, you can view its resources and delete it through
+the [[https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks?filteringText=&filteringStatus=active&viewNested=true&hideStacks=false][CloudFormation management console]].
+
+*** DONE SSH Key Selection
+
+The [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/identify_user.sh][identify_user.sh]] script prints out the name of the SSH public key to be used based on either:
+
+- the ~TOR_METRICS_SSH_KEY~ environment variable, or
+- the current user name.
+
+The environment variable takes precedence over the username to key mapping.
+
+If you change the default key you would like to use, update the mapping in this shell script.
+
+SSH keys are managed through the [[https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:][EC2 management console]] and are not (currently) managed by a CloudFormation template.
+
+** DONE Templates and Stacks
+
+There is no directory hierachy for the templates in the ~cloudformation~ folder of the repository. There are a couple of naming
+conventions used though:
+
+- Development/testing templates/stacks use a ~-dev~ suffix after the service name
+- Long-term and shared templates/stacks start with ~metrics-~
+
+*** DONE ~billing-alerts~
+
+The [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/billing-alerts.yml][~billing-alerts~ template]] sends notifications to the subscribed individuals whenever the predicted spend for the month will be
+over 50USD. Email addresses can be added here if other people should be notified too.
+
+*** DONE ~metrics-vpc~
+
+The [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/metrics-vpc.yml][~metrics-vpc~ template]] contains shared resources for Tor Metrics development templates. This includes:
+
+**** MetricsVPC and MetricsSubnet
+
+The subnet should be referenced by any resource that requires it. Use of the default VPC should be avoided as we
+share the AWS account with other Tor teams.
+
+For example, to create an EC2 instance:
+
+#+BEGIN_SRC yaml
+  Instance:
+    Type: AWS::EC2::Instance
+    Properties:
+      AvailabilityZone: !Select [ 0, !GetAZs ]
+      ImageId: ami-01db78123b2b99496
+      InstanceType: t2.large
+      SubnetId:
+        Fn::ImportValue: 'MetricsSubnet'
+      KeyName: !Ref myKeyPair
+      SecurityGroupIds:
+        - Fn::ImportValue: 'MetricsInternetSecurityGroup'
+        - Fn::ImportValue: 'MetricsPingableSecurityGroup'
+        - Fn::ImportValue: 'MetricsHTTPASecurityGroup'
+#+END_SRC
+
+Note also that the availability zone is not hardcoded to allow for portability between regions if we ever want that.
+
+**** Various security groups
+
+The EC2 example above uses some of the security groups from the ~metrics-vpc~ template. Refer to the template source
+for details on each group's rules.
+
+**** The development DNS zone
+
+Often services require TLS certificates, or require DNS names for other reasons. To facilitate this, a zone is hosted
+using Route53 allowing for DNS records to be created in CloudFormation templates. This zone is:
+~tm-dev-aws.safemetrics.org~.
+
+As an example, creating an A record for an EC2 instance with the subdomain of the stack name:
+
+#+BEGIN_SRC yaml
+  DNSName:
+    Type: AWS::Route53::RecordSet
+    Properties:
+      HostedZoneName: tm-dev-aws.safemetrics.org.
+      Name: !Join ['', [!Ref 'AWS::StackName', .tm-dev-aws.safemetrics.org.]]
+      Type: A
+      TTL: '300'
+      ResourceRecords:
+      - !GetAtt Instance.PublicIp
+#+END_SRC
+
+:FUTUREQUESTION:
+Q: /Can we use the MetricsDevZone export from ~metrics-vpc~ instead of explicitly defining the zone name every time?/
+:END:
+
+These domain names should *never* appear on anything user facing and are for *development purposes only*.
+
+*** DONE Typical Dev/Testing Stacks
+
+A typical test/dev stack will consist of an EC2 instance and a DNS name. Some services store a lot of data and may have
+a second volume attached for the data storage.
+
+An example template with one t2.large EC2 instance, a 15GB additional disk, and a DNS name:
+
+#+BEGIN_SRC yaml
+---
+# CloudFormation Stack for example development instance
+# This stack will only deploy on us-east-1 and will deploy in the Metrics VPC
+# aws cloudformation deploy --region us-east-1 --stack-name `whoami`-example-dev --template-file example-dev.yml --parameter-overrides myKeyPair="$(./identify_user.sh)"
+AWSTemplateFormatVersion: 2010-09-09
+Parameters:
+  myKeyPair:
+    Description: Amazon EC2 Key Pair
+    Type: "AWS::EC2::KeyPair::KeyName"
+Resources:
+  Instance:
+    Type: AWS::EC2::Instance
+    Properties:
+      AvailabilityZone: !Select [ 0, !GetAZs ]
+      ImageId: ami-01db78123b2b99496
+      InstanceType: t2.large
+      SubnetId:
+        Fn::ImportValue: 'MetricsSubnet'
+      KeyName: !Ref myKeyPair
+      SecurityGroupIds:
+        - Fn::ImportValue: 'MetricsInternetSecurityGroup'
+        - Fn::ImportValue: 'MetricsPingableSecurityGroup'
+        - Fn::ImportValue: 'MetricsHTTPSecurityGroup'
+        - Fn::ImportValue: 'MetricsHTTPSSecurityGroup'
+  ServiceVolume:
+    Type: AWS::EC2::Volume
+    Properties: 
+      AvailabilityZone: !Select [ 0, !GetAZs ]
+      Size: 15
+      VolumeType: gp2
+  ServiceVolumeAttachment:
+    Type: AWS::EC2::VolumeAttachment
+    Properties:
+      Device: /dev/sdb
+      InstanceId: !Ref Instance
+      VolumeId: !Ref ServiceVolume
+  DNSName:
+    Type: AWS::Route53::RecordSet
+    Properties:
+      HostedZoneName: tm-dev-aws.safemetrics.org.
+      Name: !Join ['', [!Ref 'AWS::StackName', .tm-dev-aws.safemetrics.org.]]
+      Type: A
+      TTL: '300'
+      ResourceRecords:
+      - !GetAtt Instance.PublicIp
+Outputs:
+  PublicIp:
+    Description: "Instance public IP"
+    Value: !GetAtt Instance.PublicIp
+#+END_SRC
+
+It's not common to use other AWS services as part of these templates as the goal is usually to have these services
+deployed on TPA managed hosts.
+
+** DONE Linting
+
+[[https://github.com/aws-cloudformation/cfn-python-lint][~cfn-lint~]] is used to ensure we are complying with best practices. None of the team have formal training in the use of CloudFormation
+so we are really making it up as we go along. Other tools may be used in the future, as we learn about them, to make sure we are using
+things efficiently and correctly.
+
+This is also run as part of the [[https://travis-ci.org/github/torproject/metrics-cloud/][continuous integration checks]] on Travis CI.
+
+* TODO Ansible Playbooks
+
+Ansible is an open-source software provisioning, configuration management, and application-deployment tool. It's written in Python,
+is mature, and has an extensive selection of modules for almost everything we could need.
+
+** TODO Inventories and site.yml
+
+In general, there are two inventories: [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/production][production]] and dev. Only the production inventory is committed to git, the dev inventory will
+vary between members of the team, referencing their own dev instances as created by CloudFormation. We do not specify a default
+inventory in the ~ansible.cfg~ file, so you must specify an inventory for every invocation of ~ansible-playbook~ using the ~-i~ flag:
+
+#+BEGIN_SRC shell
+ansible-playbook -i dev ...
+#+END_SRC
+
+Inside the inventory, hosts are grouped by their purpose. For each group there is a corresponding YAML file in the root of the
+~ansible~ directory that specifies a playbook for the group. All of these files are included in the ~site.yml~ master playbook to
+allow multiple hosts to be provisioned together.
+
+** TODO ~metrics-common~
+
+The [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/metrics-common][~metrics-common~]] role allows us to have a consistent environment between services, and closely matches the environment that
+would be provided by a TSA managed machine. The role handles:
+
+- installation of dependency packages from Debian (optionally from the backports repository)
+- formats additional volumes attached to the instance using the specified filesystem
+- sets the timezone to UTC (Q: /is this what TSA do?/)
+- creates user accounts for each member of the team
+  - all team members can perform unlimited passwordless sudo (TSA hosts require a password)
+  - SSH password authentication is disabled
+  - all user account passwords are removed/disabled
+- creates service user accounts as specified
+  - home directories are created as specified, and linked from ~/home/$user~
+  - lingering is enabled for service users
+
+This is all configured via group variables in the [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/group_vars][~ansible/group_vars/~]] folder. Examples there should help you to understand how
+these work. These override the [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/metrics-common/defaults/main.yml][defaults]] set in the role.
+
+** TODO Service roles
+
+** DONE Linting
+
+[[https://docs.ansible.com/ansible-lint/][~ansible-lint~]] is used to ensure we are complying with best practices. None of the team have formal training in the use of Ansible
+so we are really making it up as we go along. Other tools may be used in the future, as we learn about them, to make sure we are using
+things efficiently and correctly.
+
+This is also run as part of the [[https://travis-ci.org/github/torproject/metrics-cloud/][continuous integration checks]] on Travis CI.
+
+* TODO Common Tasks
+
+** TODO Add a new member to the team
+
+** TODO Update an SSH key for a team member
+
+** TODO Deploy and provision a development environment for a service
-- 
GitLab