draft a first 2021 roadmap authored by anarcat's avatar anarcat
...@@ -8,63 +8,106 @@ work in the coming year. ...@@ -8,63 +8,106 @@ work in the coming year.
# Overall goals # Overall goals
## Brainstorm Those goals are based on the user survey performed in December 2020
and are going to be discussed in the TPA team in January 2021. As of
2021-01-19, this is just a draft proposed by @anarcat and not formally
adopted by the team.
The following are conclusions drawn from the survey, below: As a reminder, the priority suggested by the survey is "service
stabilisation" before "new services". Furthermore, some services are
way more popular than others, so those services should get special
attention. In general, the over-arching goals are therefore:
* email delivery needs to be improved, multiple possible solutions * stabilisation (particularly email but also GitLab, schleuder, blog)
* split eugeni into lists and forwards * better communication (particularly with devs)
* setup submit-01 to deliver people's emails (#30608)
* stop treating eugeni as a smart host: have CiviCRM and RT and ## Need to have
other machines deliver their own email
* CiviCRM needs to handle its bounces * email delivery improvements:
* followup on abuse complaints * handle bounces in CiviCRM ([issue 33037](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33037))
* continue the GitLab migration: * systematically followup on and respond to abuse complaints
* setup GitLab CI for everyone, deprecate Jenkins * diagnose and resolve delivery issue (e.g. [yahoo delivery
* migrate away from Gitolite and Gitweb problems](https://gitlab.torproject.org/tpo/tpa/team/-/issues/34134))
* fix the blog, possible solutions: * provide reliable delivery for users ("my email ends up in spam!")
* migrate to static website and Discourse * possible implementations:
* split mailing lists out of eugeni
* setup submit-01 to deliver people's emails ([issue 30608](https://gitlab.torproject.org/tpo/tpa/team/-/issues/30608)))
* split schleuder out of eugeni (or retire)
* stop using eugeni as a smart host (each host sends its own
email, particularly RT and CiviCRM)
* retire old services:
* SVN ([issue 17202](https://gitlab.torproject.org/tpo/tpa/team/-/issues/17202))
* fpcentral ([issue 40009](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40009))
* gitolite (replaced with GitLab, see above)
* gitweb (replaced with GitLab, see above)
* jenkins (replaced with GitLab, see above)
* scale GitLab with ongoing and surely expanding usage
* possibly split in multiple server
* throw more hardware at it?
* monitoring?
* provide reliable and simple continuous integration services
* retire Jenkins
* replace with GitLab CI, with Windows, Mac and Linux runners
* avoid duplicate git hosting infrastructure
* retire gitolite, gitweb ([issue 36](https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/36))
* fix the blog moderation and comment moderation, possible solutions:
* migrate to a static website and Discourse
* fix formatting and improve moderation within Drupal * fix formatting and improve moderation within Drupal
* retire more services: * improve communications and monitoring:
* SVN * document "downtimes of 1 hour or longer", in a status page [issue
* fpcentral 40138](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40138)
* schleuder? * reduce alert fatigue in Nagios
* testnet? * publicize debugging tools (Grafana, user-level logging in systemd
* gitolite (to GitLab, see above) services)
* gitweb (to GitLab, see above) * encourage communication and ticket creation
* jenkins (to GitLab, see above) * move root@ and tpa "noise" to RT ([ticket 31242]( https://gitlab.torproject.org/tpo/tpa/team/-/issues/31242)), make a real
* stabilise service offering, possible solutions: mailing list for admins so that gaba and non-tech can join
* retire services (see above) * plan for hiro's vacation (replacement?)
* balance FSN/CHI ganeti clusters
* finish transitions and migrations (e.g. GitLab, main website,
etc)
* document "downtimes of 1 hour or longer", maybe part of the
monthly report? "how many 9's?" suggest mitigations when
downtimes occur (maybe just a static site made with [cstate](https://github.com/cstate/cstate)?
with contingencies for when the static site network goes down, of
course) see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40138
* above probably requires auditing and reducing noise in Nagios
alerts, because alerts fatigue makes it useless for detecting
outages right now
* improve developer experience:
* provide development/experimental VMs?
* give developers more tools to debug problems (e.g. grafana, stack
traces hidden in syslog)
* improve interaction between TPA and devs when new services are
setup
Also note the following 2020 goals that are not mentioned above and
might be added:
* moly retirement ## Nice to have
* solr/search.tpo deployment
* web metrics (#32996)
* varnish to nginx conversion (#32462)
## TODO: Need to have * improve sysadmin code base
## TODO: Nice to have * avoid YOLO commits in Puppet (possibly: server-side linting, CI)
## TODO: Non-goals * publish our Puppet repository ([ticket 29387](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387))
# TODO: Quarterly breakdown * reduce dependency on Python 2 code (see [short term LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#short-term-merge-with-upstream-port-to-python-3-if-necessary))
* reduce dependency on LDAP (move hosts to Puppet? see [mid term
LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#mid-term-move-hosts-to-puppet-possibly-replace-ud-ldap-with-simpler-dashboard))
* retire more old services:
* testnet?
* schleuder?
* provide secure, end-to-end authentication of Tor source code
([issue 81](https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/81))
* finish retiring old hardware (moly, [ticket 29974](https://gitlab.torproject.org/legacy/trac/-/issues/29974))
* varnish to nginx conversion (#32462)
* solr/search.tpo deployment (#33106)?
* web metrics (#32996)?
* GitLab pages hosting
* experiment with containers/kubernetes for CI/CD
## Non-goals
* complete email service: not enough time / budget (or delegate + pay Riseup)
* "provide development/experimental VMs": would be possible through
GitLab CD, to be investigated once we have GitLab CI solidly
running
* "improve interaction between TPA and devs when new services are
setup": see "improve communications" above, and "experimental
VMs". The endgame here is people will be able to deploy their own
services through Docker, but this will likely not happen in 2021
* static mirror network retirement / rearchitecture: we want to test
out GitLab pages first and see if it can provide a decent
alternative
* TODO: "finish main website transition", "broken links on
website"... should TPA cover for web stuff?
* TODO: are service admins still a thing? should we cover for things
like the metrics team?
* complete puppetization: old legacy services are not in Puppet. that
is fine: we keep maintaining them by hand when relevant, but new
services should all be built in Puppet
* replace Nagios with Prometheus: not a short term goal, no clear
benefit. reduce the noise in Nagios instead
# Quarterly breakdown
## Q1 ## Q1
...@@ -73,6 +116,17 @@ this roadmap is concerned. It should include items we are fairly ...@@ -73,6 +116,17 @@ this roadmap is concerned. It should include items we are fairly
certain to be able to complete within the next few months or certain to be able to complete within the next few months or
so. Postponing those could cause problems. so. Postponing those could cause problems.
* email delivery improvements:
* handle bounces in CiviCRM ([issue 33037](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33037))
* followup on abuse complaints
* diagnose and resolve delivery issue (e.g. [yahoo delivery
problems](https://gitlab.torproject.org/tpo/tpa/team/-/issues/34134))
* GitLab CI deployment, plan for Jenkins retirement
* setup a discourse instance, deprecate blog comments?
* plan for blog replacement?
* document "downtimes of 1 hour or longer", in a status page [issue
40138](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40138)
## Q2 ## Q2
Second quarter is a little more vague, but should still be Second quarter is a little more vague, but should still be
...@@ -80,6 +134,17 @@ Second quarter is a little more vague, but should still be ...@@ -80,6 +134,17 @@ Second quarter is a little more vague, but should still be
wait a little longer or that are part of longer projects that will wait a little longer or that are part of longer projects that will
take longer to complete. take longer to complete.
* retire old services:
* SVN ([issue 17202](https://gitlab.torproject.org/tpo/tpa/team/-/issues/17202))
* fpcentral ([issue 40009](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40009))
* establish plan for gitolite/gitweb retirement
* improve sysadmin code base
* avoid YOLO commits in Puppet (possibly: server-side linting, CI)
* publish our Puppet repository ([ticket 29387](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387))
* reduce dependency on Python 2 code (see [short term LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#short-term-merge-with-upstream-port-to-python-3-if-necessary))
* reduce dependency on LDAP (move hosts to Puppet? see [mid term
LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#mid-term-move-hosts-to-puppet-possibly-replace-ud-ldap-with-simpler-dashboard))
## Q3 ## Q3
From our experience, after three quarters, things get difficult to From our experience, after three quarters, things get difficult to
...@@ -88,13 +153,19 @@ time before this time, which totally changed basic assumptions about ...@@ -88,13 +153,19 @@ time before this time, which totally changed basic assumptions about
worker availability and priorities. worker availability and priorities.
Also, a global pandemic basically tore the world apart, throwing Also, a global pandemic basically tore the world apart, throwing
everything in the air. everything in the air, so obviously plans kind of went out the
window. Hopefully this won't happen again and the pandemic will
somewhat subside, but we should plan for the worst.
* jenkins retirement?
## Q4 ## Q4
Obviously, the fourth quarter is sheer crystal balling at this stage, Obviously, the fourth quarter is sheer crystal balling at this stage,
but it should still be an interesting exercise to perform. but it should still be an interesting exercise to perform.
* gitolite/gitweb retirement?
# 2020 roadmap evaluation # 2020 roadmap evaluation
The following is a review of the 2020 roadmap. The following is a review of the 2020 roadmap.
... ...
......