diff --git a/metrics/ops/exit-ops.mdwn b/metrics/ops/exit-ops.mdwn index c6e96ea517549eb3241a5854c117c16dbec975e2..4d5c8e18176e9ae47b5fad1865fb5961853068cd 100644 --- a/metrics/ops/exit-ops.mdwn +++ b/metrics/ops/exit-ops.mdwn @@ -1,8 +1,9 @@ - # exit-ops Exit Scanner, TorDNSEL and Tor Check Operations +[[!toc levels=3]] + # Overview While the three services described in this document could be implemented as discrete components, diff --git a/metrics/ops/metrics-cloud.mdwn b/metrics/ops/metrics-cloud.mdwn index 44367f8e86040c6ab1ca144f614658dace45a114..602721a3432bf5f1a0355d71692c486add083682 100644 --- a/metrics/ops/metrics-cloud.mdwn +++ b/metrics/ops/metrics-cloud.mdwn @@ -1,31 +1,5 @@ -# Table of Contents - -1. [Synopsis](#orgb3a4817) -2. [Usage of AWS for Tor Metrics Development](#orgb76cd81) - 1. [CloudFormation Templates](#orgee150b1) - 1. [Quickstart: Deploying a template](#org7813a03) - 2. [SSH Key Selection](#orgdc7711c) - 2. [Templates and Stacks](#org19b1306) - 1. [`billing-alerts`](#org1b9ae57) - 2. [`metrics-vpc`](#org2c178f5) - 3. [Typical Dev/Testing Stacks](#org97f9e67) - 3. [Linting](#orga89e157) -3. [Ansible Playbooks](#org8371364) - 1. [Inventories and site.yml](#org81a0dc9) - 2. [`metrics-common`](#org55e2902) - 3. [Service roles](#org7050aae) - 4. [Linting](#org9684f51) -4. [Common Tasks](#org8267248) - 1. [Add a new member to the team](#org9040a14) - 2. [Update an SSH key for a team member](#org97696ab) - 3. [Deploy and provision a development environment for a service](#org400659a) - - - -<a id="orgb3a4817"></a> - -# DONE Synopsis +# Overview The metrics-cloud framework aims to enable: @@ -43,9 +17,7 @@ The CloudFormation templates are relevant only to testing and development, while to both environments. -<a id="orgb76cd81"></a> - -# DONE Usage of AWS for Tor Metrics Development +# Usage of AWS for Tor Metrics Development Each member of the Tor Metrics team has a standing allowance of 100USD/month for development using AWS. In practice, we have not used more than 50USD/month for the team in any one month and generally sit around 25USD/month. It is @@ -53,9 +25,7 @@ still important to minimize costs when using AWS and the use of CloudFormation t rapid creation, provisioning and destruction should help with this. -<a id="orgee150b1"></a> - -## DONE CloudFormation Templates +## CloudFormation Templates CloudFormation is an AWS service allowing the definition of *stacks*. These stacks describe a series of AWS services using a domain-specific language and allow for the easy creation of a number of interconnected resources. All resources @@ -71,9 +41,7 @@ tracking of spending in the billing portal through the tags. Documentation for CloudFormation, including an API reference, can be found at: <https://docs.aws.amazon.com/cloudformation/>. -<a id="org7813a03"></a> - -### DONE Quickstart: Deploying a template +### Quickstart: Deploying a template Each template begins with comments with any relevant notes about the template, and a deployment command that will upload and deploy the template on AWS. The commands will look something like: @@ -88,9 +56,7 @@ Once the stack has been deployed from the template, you can view its resources a the [CloudFormation management console](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks?filteringText=&filteringStatus=active&viewNested=true&hideStacks=false). -<a id="orgdc7711c"></a> - -### DONE SSH Key Selection +### SSH Key Selection The [identify\_user.sh](https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/identify_user.sh) script prints out the name of the SSH public key to be used based on either: @@ -104,9 +70,7 @@ If you change the default key you would like to use, update the mapping in this SSH keys are managed through the [EC2 management console](https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:) and are not (currently) managed by a CloudFormation template. -<a id="org19b1306"></a> - -## DONE Templates and Stacks +## Templates and Stacks There is no directory hierachy for the templates in the `cloudformation` folder of the repository. There are a couple of naming conventions used though: @@ -115,17 +79,13 @@ conventions used though: - Long-term and shared templates/stacks start with `metrics-` -<a id="org1b9ae57"></a> - -### DONE `billing-alerts` +### `billing-alerts` The [`billing-alerts` template](https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/billing-alerts.yml) sends notifications to the subscribed individuals whenever the predicted spend for the month will be over 50USD. Email addresses can be added here if other people should be notified too. -<a id="org2c178f5"></a> - -### DONE `metrics-vpc` +### `metrics-vpc` The [`metrics-vpc` template](https://gitweb.torproject.org/metrics-cloud.git/tree/cloudformation/metrics-vpc.yml) contains shared resources for Tor Metrics development templates. This includes: @@ -180,9 +140,7 @@ The [`metrics-vpc` template](https://gitweb.torproject.org/metrics-cloud.git/tre These domain names should **never** appear on anything user facing and are for **development purposes only**. -<a id="org97f9e67"></a> - -### DONE Typical Dev/Testing Stacks +### Typical Dev/Testing Stacks A typical test/dev stack will consist of an EC2 instance and a DNS name. Some services store a lot of data and may have a second volume attached for the data storage. @@ -243,9 +201,7 @@ It's not common to use other AWS services as part of these templates as the goal deployed on TPA managed hosts. -<a id="orga89e157"></a> - -## DONE Linting +## Linting [`cfn-lint`](https://github.com/aws-cloudformation/cfn-python-lint) is used to ensure we are complying with best practices. None of the team have formal training in the use of CloudFormation so we are really making it up as we go along. Other tools may be used in the future, as we learn about them, to make sure we are using @@ -254,17 +210,13 @@ things efficiently and correctly. This is also run as part of the [continuous integration checks](https://travis-ci.org/github/torproject/metrics-cloud/) on Travis CI. -<a id="org8371364"></a> - -# TODO Ansible Playbooks +# Ansible Playbooks Ansible is an open-source software provisioning, configuration management, and application-deployment tool. It's written in Python, is mature, and has an extensive selection of modules for almost everything we could need. -<a id="org81a0dc9"></a> - -## TODO Inventories and site.yml +## Inventories and site.yml In general, there are two inventories: [production](https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/production) and dev. Only the production inventory is committed to git, the dev inventory will vary between members of the team, referencing their own dev instances as created by CloudFormation. We do not specify a default @@ -277,9 +229,7 @@ Inside the inventory, hosts are grouped by their purpose. For each group there i allow multiple hosts to be provisioned together. -<a id="org55e2902"></a> - -## TODO `metrics-common` +## `metrics-common` The [`metrics-common`](https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/metrics-common) role allows us to have a consistent environment between services, and closely matches the environment that would be provided by a TSA managed machine. The role handles: @@ -299,14 +249,10 @@ This is all configured via group variables in the [`ansible/group_vars/`](https: these work. These override the [defaults](https://gitweb.torproject.org/metrics-cloud.git/tree/ansible/roles/metrics-common/defaults/main.yml) set in the role. -<a id="org7050aae"></a> - -## TODO Service roles - +## Service roles -<a id="org9684f51"></a> -## DONE Linting +## Linting [`ansible-lint`](https://docs.ansible.com/ansible-lint/) is used to ensure we are complying with best practices. None of the team have formal training in the use of Ansible so we are really making it up as we go along. Other tools may be used in the future, as we learn about them, to make sure we are using @@ -315,22 +261,14 @@ things efficiently and correctly. This is also run as part of the [continuous integration checks](https://travis-ci.org/github/torproject/metrics-cloud/) on Travis CI. -<a id="org8267248"></a> - -# TODO Common Tasks - - -<a id="org9040a14"></a> - -## TODO Add a new member to the team +# Common Tasks -<a id="org97696ab"></a> +## Add a new member to the team -## TODO Update an SSH key for a team member +## Update an SSH key for a team member -<a id="org400659a"></a> -## TODO Deploy and provision a development environment for a service +## Deploy and provision a development environment for a service diff --git a/metrics/ops/monitoring.mdwn b/metrics/ops/monitoring.mdwn new file mode 100644 index 0000000000000000000000000000000000000000..de4d6805e6602ffe6c42ee6983412d17d31c8625 --- /dev/null +++ b/metrics/ops/monitoring.mdwn @@ -0,0 +1,56 @@ +# monitoring + +[[!toc levels=3]] + +## CollecTor + +This is a TSA host so already has a bunch of ping and NRPE checks. Application +specific checks are mostly looking at the index file: + +* That there is an index file that parses and: + * it was recently updated + * it contains a recent run for: + * bridge descriptors + * relay descriptors + * exit lists + +The old check uses bushel's CollecTor index parser, but we could equally hack +up a single python script to do this with the JSON at a lower level. In the +end it looks a lot like the Onionoo plugin on the TSA Nagios. + +## Onionoo + +We have a Python script that runs on the TSA Nagios to check Onionoo. + +https://gitweb.torproject.org/admin/tor-nagios.git/tree/tor-nagios-checks/checks/tor-check-onionoo + +### Bonus Points + +A quick win for someone with some time, I had started extending this to check +a relay's status (with a relay ops hat on): + +* Onionoo is unhappy => UNKNOWN (because we're monitoring the relay not Onionoo) +* Tor version number not recommended => WARN +* Last changed address recently => WARN +* BadExit flag is present => WARN +* Not running => CRIT +* Rate of change of consensus weight is large => WARN +* Rate of change of bandwidth usage is large => WARN +* Otherwise => OK + +If it's OK, output the current set of flags alphabetically sorted (or at least +consistently sorted) and include the current consensus weight and bandwidth +values in Nagios performance data format. + +## OnionPerf + +The primary issue with OnionPerfs is that they run out of disk space. A decent +set of ping and NRPE checks should cover most of the common issues we've had. + +Application specific checks would include: + +* that a file is available in the webserver root for the last analysis run +* that there is something listening on the tgen connect port + * also on the onion service +* that the HTTPS certificate is valid and not about to expire (on port 8443) +