Metrics OKRs 2021

This milestone tracks Metrics OKRs for the end of 2021

OBJECTIVE 1. Consolidate metrics systems and data monitoring

Key results

1.1 Consolidate metrics monitoring and alerting by using Prometheus across services

1.2 Monitor issues with data. Receive a notification when we are missing data-points.

1.3 Build a common log policy across services.

Metrics systems are all monitored via Nagios and in some cases Prometheus [1].

Our systems monitoring strategy should be consolidated. Since the sysadmin team is moving to prometheus all metrics systems should be monitored via prometheus.

Data processes should be monitored and POC(s) should be alerted when something is broken in our data processes.

Log policy should be reviewed. At the moment some processes sends part of their logs via cron emails and some other logs are written to system. We should consolidate what is logged and what is sent via email alerts, so that it is easier to debug application issues.

OBJECTIVE 2. Improve stability and scalability

Key results

2.1 Reduce i/o dependency of metrics services by 30%.

2.2 Research and deploy a data store for metrics services.

2.3 Implement a data API for metrics services.

It should be possible to scale and plan new features for the metrics pipeline without having to worry about losing data because of unknown instabilities or legacies

Main metrics services (collector, onionoo, website) are i/o intensive (ether network or disk). Memory errors or processes taking too much time have caused data loss. Document how this can be optimized and/or reduced.

Implement a MVP of a metrics data store for medium term data (collector being our long term historical archive). Evaluate different possibilities (Cassandra vs Postgresql vs Hadoop).

OBJECTIVE 3. Restructure the code base to tackle technical debt and improve access to metrics data.

Key results

3.1 Document and evaluate current metrics data models from the R server.

3.2 Separate relay-search from the static website.

3.3 Use grafana and metabase to analyze and monitor metrics data.

Current metrics models regarding data and analysis live in the R server and in the postgresql db on meronense. It is time to re-evaluate the state of these models and come up with new reports on the present and future of Tor metrics [2][3].

Our current code base is running Java 8 which is slowly reaching the point where our dependencies will not support our code anymore. We should start planning how to restructure our current code base and identify where is most of our technical debt.

A good starting point for this effort could be planning a redesign of the current stack powering the metrics website.

The current website is currently composed of:

The R server
The metrics DB on meronense
Relay search
Metrics website

Finally offer analytical tools to reduce the "time to insight" onto metrics data, like metabase for the website analytics and grafana to monitor the network.

Related Tickets:

team#73 (closed)