diff --git a/metrics/ops/exit-ops.org b/metrics/ops/exit-ops.org new file mode 100644 index 0000000000000000000000000000000000000000..1e2f8e13fe25a4853f2692ce46fdaa6c4d130032 --- /dev/null +++ b/metrics/ops/exit-ops.org @@ -0,0 +1,182 @@ +#+OPTIONS: broken-links:t + +* TODO Name + +**exit-ops** - Exit Scanner, TorDNSEL and Tor Check Operations + +* TODO Synopsis + +While the three services described in this document could be implemented as discrete components, +they currently have tight coupling which means they must all be deployed on the same host. + +** TODO Exit Scanner [0/3] + +The exit scanner performs active measurement of Tor exit relays in order to determine the IP addresses that are used for exit connections. +The active measurement uses an exitmap module, which is wrapped in a script to produce output formatted as an [Exit List](https://metrics.torproject.org/collector.html#type-tordnsel). + +The exit list results are consumed by CollecTor, [TorDNSEL](tordnsel) and [Tor Check](../check-ops/). +Exit lists and bulk exit lists are also consumed by third-party external applications at the following URLs: + +- https://check.torproject.org/exit-addresses - Latest exit list +- https://check.torproject.org/torbulkexitlist - Latest bulk exit list + +Documentation questions: + +- [ ] How long do we keep old measurements in the exit list? +- [ ] What are the timings for measurement runs? +- [ ] How many old exit lists do we keep around? + +** TODO TorDNSEL [0/2] + +TorDNSEL is a DNS list service that behaves in a similar way to [[https://en.wikipedia.org/wiki/Domain_Name_System-based_Blackhole_List][Domain Name System-based Blackhole Lists]]. +IP addresses will give positive results in the event that an address has been found to be used by an exit relay in a recent scan. + +Documentation questions: + +- [ ] For how long does an address give a positive result? +- [ ] Do we also include all IP addresses of exit flagged relays in the consensus? + +** TODO Tor Check [0/1] + +Tor Check is a website that can be used to determine if a browser is using the Tor network for queries. +It will also check the User-Agent to determine if a user is using Tor Browser. +It is accessed via HTTPS at https://check.torproject.org/. + +Documentation questions: + +- [ ] Where is the JSON API? + +* DONE Contacts + +The primary contact for this service is the Metrics Team <[[mailto:metrics-team@lists.torproject.org][metrics-team@lists.torproject.org]]>. +For urgent queries, contact *karsten*, *irl*, or *gaba* in [[ircs://irc.oftc.net:6697/tor-project][#tor-project]]. + +* TODO Overview + +The underlying infrastructure for the exit scanner, TorDNSEL and Tor Check services is provided by the +Tor Sysadmin Team (TSA). All services run on one virtual machine with the hostname ~check-01.torproject.org~. + +** TODO Exit Scanner + +Documentation questions: + +- [ ] Where is the exitmap module? +- [ ] What are the services called? +- [ ] What user is used? + +** TODO TorDNSEL + +Documentation questions: + +- [ ] Where does the zone file live? +- [ ] Ticket about doing DNSSEC signing +- [ ] Where is DNS served +- [ ] What name is delegated +- [ ] Can delegation work in testing environment? + +* DONE Sources + +The sources for exitmap are available on GitHub: https://github.com/NullHypothesis/exitmap. +The [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/exit-scanner/files/exitscan.py][exitmap wrapper]] and [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/exit-scanner/files/ipscan.py][module]] used by the exit scanner can be found in the metrics-cloud repository. + +The wrapper script is also responsible for writing out the zone file to be used by the TorDNSEL service +and triggering a reload of the zone. + +The sources for Tor Check are available in our git: https://gitweb.torproject.org/check.git. + +* TODO Deployment + +** DONE Initial deployment + +The initial deployment procedure is split into 2 parts: + +- System setup +- Installing and starting the services + +There are no manual steps required to load state, and backups do not need to be performed for the host running this service. +Everything can be configured from scratch with only the Ansible playbook. + +*** DONE Development/testing in AWS + +For development or testing in AWS, a CloudFormation template is available named [[https://gitweb.torproject.org/metrics-cloud.git/plain/cloudformation/exit-scanner-dev.yml][~exit-scanner-dev.yml~]]. + +From the [[https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks][CloudFormation portal]], select your stack and view the outputs. +You will find here the public IP address for the EC2 instance that has been created. +Add this instance to *ansible/dev* in your local copy of metrics-cloud.git under "[exit-scanners]". + +You can now setup the machine with Ansible by running: + +``` +ansible-playbook -i dev exit-scanners-aws.yml +``` + +Note that the AWS AMI used has passwordless sudo, so no password need be given. + +*** DONE Fresh machine from TSA + +Add the host name of the new instance to *ansible/production* in your local +copy of metrics-cloud.git under "[exit-scanners]" and commit the change. + +You can now setup the machine with Ansible by running: + +``` +ansible-playbook -i production -K exit-scanners.yml +``` + +** TODO Upgrade [0/2] + +The upstream sources for the applications that make up this service do not have managed releases +which makes this difficult. + +To fix a bug in the exit scanner wrapper script, fix this in the metrics-cloud repository and re-run +the deployment playbook. + +- [ ] Can we upgrade exitmap sensibly? +- [ ] Can we upgrade Tor Check sensibly? + +* TODO Diagnostics + +** TODO Logs [0/2] + +- [ ] What things log? +- [ ] Where do the logs go? + +* TODO Monitoring [0/2] + +- [ ] CollecTor log messages +- [ ] Nagios + +* DONE Disaster Recovery + +The exit scanner service does not need to maintain any state between runs. +It's nice if it can in order to cope with a relay that happened to be down at the time we tried to measure +it but in the event of a failure it is perfectly acceptable to throw away the old box and provision a new one. +Follow the initial deployment instructions above. + +* TODO Service Level Agreement + +* TODO See Also + +* TODO Standards + +The exit scanner service produces exit lists according to the [[https://2019.www.torproject.org/tordnsel/exitlist-spec.txt][TorDNSEL exit list format]]. + +* TODO History + +* TODO Authors + +* DONE Major Caveats + +The exit scanner service does not support IPv6. + +* DONE Bugs + +Known bugs can be found in the Tor Project Trac for: + +- [[https://trac.torproject.org/projects/tor/query?status=!closed&component=Metrics%2FExit%20Scanner][Exit Scanner]] +- [[https://trac.torproject.org/projects/tor/query?status=!closed&component=Applications%2FTor%20Check][Tor Check]] + +For bugs relating to exitmap, they are found on the GitHub project: https://github.com/NullHypothesis/exitmap/issues + +New bug reports should be filed in the appropriate tracker and component. + diff --git a/metrics/ops/metrics-cloud.org b/metrics/ops/metrics-cloud.org new file mode 100644 index 0000000000000000000000000000000000000000..17eb3752eb2fb5ead19e5f295ab2fa661db1d07d --- /dev/null +++ b/metrics/ops/metrics-cloud.org @@ -0,0 +1,269 @@ +#+TITLE: metrics-cloud: Scripts for orchestrating Tor Metrics services +#+OPTIONS: ^:nil + +* DONE Synopsis + +The metrics-cloud framework aims to enable: + +- reproducible deployments of software +- consistency between those software deployments + +Side-effects of these goals are: + +- reproducible experiments (good science) +- reduced maintainence costs +- reduced human error + +There are currently two components to the metrics-cloud framework: CloudFormation templates and Ansible playbooks. +The CloudFormation templates are relevant only to testing and development, while the Ansible playbooks are applicable +to both environments. + +* DONE Usage of AWS for Tor Metrics Development + +Each member of the Tor Metrics team has a standing allowance of 100USD/month for development using AWS. In practice, +we have not used more than 50USD/month for the team in any one month and generally sit around 25USD/month. It is +still important to minimize costs when using AWS and the use of CloudFormation templates and Ansible playbooks for +rapid creation, provisioning and destruction should help with this. + +** DONE CloudFormation Templates + +CloudFormation is an AWS service allowing the definition of /stacks/. These stacks describe a series of AWS services +using a domain-specific language and allow for the easy creation of a number of interconnected resources. All resources +in a stack are tagged with the stack name which allows for tracking of costs per project. Each stack can also have all +resources terminated together easily, allowing stacks to exist for only as long as they are needed. + +The CloudFormation templates used in the framework can be found in the [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation][cloudformation]] folder of the repository. + +It may be that for some services the templates are very simple, and others may be more complex. No matter the level of +complexity we still want to use the templates to ensure we are meeting the key goals of the framework and also to simplify +tracking of spending in the billing portal through the tags. + +Documentation for CloudFormation, including an API reference, can be found at: https://docs.aws.amazon.com/cloudformation/. + +*** DONE Quickstart: Deploying a template + +Each template begins with comments with any relevant notes about the template, and a deployment command that will upload +and deploy the template on AWS. The commands will look something like: + +#+BEGIN_SRC shell +aws cloudformation deploy --region us-east-1 --stack-name `whoami`-exit-scanner-dev --template-file exit-scanner-dev.yml --parameter-overrides myKeyPair="$(./identify_user.sh)" +#+END_SRC + +You'll notice that the command includes a call to ~whoami~ to prefix the stack name with your current username, and also +that the ~identify_user.sh~ script is used to determine which SSH key to use for new instances. +You do not need to modify this command line before running it. + +Once the stack has been deployed from the template, you can view its resources and delete it through +the [[https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks?filteringText=&filteringStatus=active&viewNested=true&hideStacks=false][CloudFormation management console]]. + +*** DONE SSH Key Selection + +The [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/identify_user.sh][identify_user.sh]] script prints out the name of the SSH public key to be used based on either: + +- the ~TOR_METRICS_SSH_KEY~ environment variable, or +- the current user name. + +The environment variable takes precedence over the username to key mapping. + +If you change the default key you would like to use, update the mapping in this shell script. + +SSH keys are managed through the [[https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:][EC2 management console]] and are not (currently) managed by a CloudFormation template. + +** DONE Templates and Stacks + +There is no directory hierachy for the templates in the ~cloudformation~ folder of the repository. There are a couple of naming +conventions used though: + +- Development/testing templates/stacks use a ~-dev~ suffix after the service name +- Long-term and shared templates/stacks start with ~metrics-~ + +*** DONE ~billing-alerts~ + +The [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/billing-alerts.yml][~billing-alerts~ template]] sends notifications to the subscribed individuals whenever the predicted spend for the month will be +over 50USD. Email addresses can be added here if other people should be notified too. + +*** DONE ~metrics-vpc~ + +The [[https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/metrics-vpc.yml][~metrics-vpc~ template]] contains shared resources for Tor Metrics development templates. This includes: + +**** MetricsVPC and MetricsSubnet + +The subnet should be referenced by any resource that requires it. Use of the default VPC should be avoided as we +share the AWS account with other Tor teams. + +For example, to create an EC2 instance: + +#+BEGIN_SRC yaml + Instance: + Type: AWS::EC2::Instance + Properties: + AvailabilityZone: !Select [ 0, !GetAZs ] + ImageId: ami-01db78123b2b99496 + InstanceType: t2.large + SubnetId: + Fn::ImportValue: 'MetricsSubnet' + KeyName: !Ref myKeyPair + SecurityGroupIds: + - Fn::ImportValue: 'MetricsInternetSecurityGroup' + - Fn::ImportValue: 'MetricsPingableSecurityGroup' + - Fn::ImportValue: 'MetricsHTTPASecurityGroup' +#+END_SRC + +Note also that the availability zone is not hardcoded to allow for portability between regions if we ever want that. + +**** Various security groups + +The EC2 example above uses some of the security groups from the ~metrics-vpc~ template. Refer to the template source +for details on each group's rules. + +**** The development DNS zone + +Often services require TLS certificates, or require DNS names for other reasons. To facilitate this, a zone is hosted +using Route53 allowing for DNS records to be created in CloudFormation templates. This zone is: +~tm-dev-aws.safemetrics.org~. + +As an example, creating an A record for an EC2 instance with the subdomain of the stack name: + +#+BEGIN_SRC yaml + DNSName: + Type: AWS::Route53::RecordSet + Properties: + HostedZoneName: tm-dev-aws.safemetrics.org. + Name: !Join ['', [!Ref 'AWS::StackName', .tm-dev-aws.safemetrics.org.]] + Type: A + TTL: '300' + ResourceRecords: + - !GetAtt Instance.PublicIp +#+END_SRC + +:FUTUREQUESTION: +Q: /Can we use the MetricsDevZone export from ~metrics-vpc~ instead of explicitly defining the zone name every time?/ +:END: + +These domain names should *never* appear on anything user facing and are for *development purposes only*. + +*** DONE Typical Dev/Testing Stacks + +A typical test/dev stack will consist of an EC2 instance and a DNS name. Some services store a lot of data and may have +a second volume attached for the data storage. + +An example template with one t2.large EC2 instance, a 15GB additional disk, and a DNS name: + +#+BEGIN_SRC yaml +--- +# CloudFormation Stack for example development instance +# This stack will only deploy on us-east-1 and will deploy in the Metrics VPC +# aws cloudformation deploy --region us-east-1 --stack-name `whoami`-example-dev --template-file example-dev.yml --parameter-overrides myKeyPair="$(./identify_user.sh)" +AWSTemplateFormatVersion: 2010-09-09 +Parameters: + myKeyPair: + Description: Amazon EC2 Key Pair + Type: "AWS::EC2::KeyPair::KeyName" +Resources: + Instance: + Type: AWS::EC2::Instance + Properties: + AvailabilityZone: !Select [ 0, !GetAZs ] + ImageId: ami-01db78123b2b99496 + InstanceType: t2.large + SubnetId: + Fn::ImportValue: 'MetricsSubnet' + KeyName: !Ref myKeyPair + SecurityGroupIds: + - Fn::ImportValue: 'MetricsInternetSecurityGroup' + - Fn::ImportValue: 'MetricsPingableSecurityGroup' + - Fn::ImportValue: 'MetricsHTTPSecurityGroup' + - Fn::ImportValue: 'MetricsHTTPSSecurityGroup' + ServiceVolume: + Type: AWS::EC2::Volume + Properties: + AvailabilityZone: !Select [ 0, !GetAZs ] + Size: 15 + VolumeType: gp2 + ServiceVolumeAttachment: + Type: AWS::EC2::VolumeAttachment + Properties: + Device: /dev/sdb + InstanceId: !Ref Instance + VolumeId: !Ref ServiceVolume + DNSName: + Type: AWS::Route53::RecordSet + Properties: + HostedZoneName: tm-dev-aws.safemetrics.org. + Name: !Join ['', [!Ref 'AWS::StackName', .tm-dev-aws.safemetrics.org.]] + Type: A + TTL: '300' + ResourceRecords: + - !GetAtt Instance.PublicIp +Outputs: + PublicIp: + Description: "Instance public IP" + Value: !GetAtt Instance.PublicIp +#+END_SRC + +It's not common to use other AWS services as part of these templates as the goal is usually to have these services +deployed on TPA managed hosts. + +** DONE Linting + +[[https://github.com/aws-cloudformation/cfn-python-lint][~cfn-lint~]] is used to ensure we are complying with best practices. None of the team have formal training in the use of CloudFormation +so we are really making it up as we go along. Other tools may be used in the future, as we learn about them, to make sure we are using +things efficiently and correctly. + +This is also run as part of the [[https://travis-ci.org/github/torproject/metrics-cloud/][continuous integration checks]] on Travis CI. + +* TODO Ansible Playbooks + +Ansible is an open-source software provisioning, configuration management, and application-deployment tool. It's written in Python, +is mature, and has an extensive selection of modules for almost everything we could need. + +** TODO Inventories and site.yml + +In general, there are two inventories: [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/production][production]] and dev. Only the production inventory is committed to git, the dev inventory will +vary between members of the team, referencing their own dev instances as created by CloudFormation. We do not specify a default +inventory in the ~ansible.cfg~ file, so you must specify an inventory for every invocation of ~ansible-playbook~ using the ~-i~ flag: + +#+BEGIN_SRC shell +ansible-playbook -i dev ... +#+END_SRC + +Inside the inventory, hosts are grouped by their purpose. For each group there is a corresponding YAML file in the root of the +~ansible~ directory that specifies a playbook for the group. All of these files are included in the ~site.yml~ master playbook to +allow multiple hosts to be provisioned together. + +** TODO ~metrics-common~ + +The [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/metrics-common][~metrics-common~]] role allows us to have a consistent environment between services, and closely matches the environment that +would be provided by a TSA managed machine. The role handles: + +- installation of dependency packages from Debian (optionally from the backports repository) +- formats additional volumes attached to the instance using the specified filesystem +- sets the timezone to UTC (Q: /is this what TSA do?/) +- creates user accounts for each member of the team + - all team members can perform unlimited passwordless sudo (TSA hosts require a password) + - SSH password authentication is disabled + - all user account passwords are removed/disabled +- creates service user accounts as specified + - home directories are created as specified, and linked from ~/home/$user~ + - lingering is enabled for service users + +This is all configured via group variables in the [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/group_vars][~ansible/group_vars/~]] folder. Examples there should help you to understand how +these work. These override the [[https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/metrics-common/defaults/main.yml][defaults]] set in the role. + +** TODO Service roles + +** DONE Linting + +[[https://docs.ansible.com/ansible-lint/][~ansible-lint~]] is used to ensure we are complying with best practices. None of the team have formal training in the use of Ansible +so we are really making it up as we go along. Other tools may be used in the future, as we learn about them, to make sure we are using +things efficiently and correctly. + +This is also run as part of the [[https://travis-ci.org/github/torproject/metrics-cloud/][continuous integration checks]] on Travis CI. + +* TODO Common Tasks + +** TODO Add a new member to the team + +** TODO Update an SSH key for a team member + +** TODO Deploy and provision a development environment for a service