draft a first 2021 roadmap authored by anarcat's avatar anarcat
......@@ -8,63 +8,106 @@ work in the coming year.
# Overall goals
## Brainstorm
Those goals are based on the user survey performed in December 2020
and are going to be discussed in the TPA team in January 2021. As of
2021-01-19, this is just a draft proposed by @anarcat and not formally
adopted by the team.
The following are conclusions drawn from the survey, below:
As a reminder, the priority suggested by the survey is "service
stabilisation" before "new services". Furthermore, some services are
way more popular than others, so those services should get special
attention. In general, the over-arching goals are therefore:
* email delivery needs to be improved, multiple possible solutions
* split eugeni into lists and forwards
* setup submit-01 to deliver people's emails (#30608)
* stop treating eugeni as a smart host: have CiviCRM and RT and
other machines deliver their own email
* CiviCRM needs to handle its bounces
* followup on abuse complaints
* continue the GitLab migration:
* setup GitLab CI for everyone, deprecate Jenkins
* migrate away from Gitolite and Gitweb
* fix the blog, possible solutions:
* migrate to static website and Discourse
* stabilisation (particularly email but also GitLab, schleuder, blog)
* better communication (particularly with devs)
## Need to have
* email delivery improvements:
* handle bounces in CiviCRM ([issue 33037](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33037))
* systematically followup on and respond to abuse complaints
* diagnose and resolve delivery issue (e.g. [yahoo delivery
problems](https://gitlab.torproject.org/tpo/tpa/team/-/issues/34134))
* provide reliable delivery for users ("my email ends up in spam!")
* possible implementations:
* split mailing lists out of eugeni
* setup submit-01 to deliver people's emails ([issue 30608](https://gitlab.torproject.org/tpo/tpa/team/-/issues/30608)))
* split schleuder out of eugeni (or retire)
* stop using eugeni as a smart host (each host sends its own
email, particularly RT and CiviCRM)
* retire old services:
* SVN ([issue 17202](https://gitlab.torproject.org/tpo/tpa/team/-/issues/17202))
* fpcentral ([issue 40009](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40009))
* gitolite (replaced with GitLab, see above)
* gitweb (replaced with GitLab, see above)
* jenkins (replaced with GitLab, see above)
* scale GitLab with ongoing and surely expanding usage
* possibly split in multiple server
* throw more hardware at it?
* monitoring?
* provide reliable and simple continuous integration services
* retire Jenkins
* replace with GitLab CI, with Windows, Mac and Linux runners
* avoid duplicate git hosting infrastructure
* retire gitolite, gitweb ([issue 36](https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/36))
* fix the blog moderation and comment moderation, possible solutions:
* migrate to a static website and Discourse
* fix formatting and improve moderation within Drupal
* retire more services:
* SVN
* fpcentral
* schleuder?
* testnet?
* gitolite (to GitLab, see above)
* gitweb (to GitLab, see above)
* jenkins (to GitLab, see above)
* stabilise service offering, possible solutions:
* retire services (see above)
* balance FSN/CHI ganeti clusters
* finish transitions and migrations (e.g. GitLab, main website,
etc)
* document "downtimes of 1 hour or longer", maybe part of the
monthly report? "how many 9's?" suggest mitigations when
downtimes occur (maybe just a static site made with [cstate](https://github.com/cstate/cstate)?
with contingencies for when the static site network goes down, of
course) see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40138
* above probably requires auditing and reducing noise in Nagios
alerts, because alerts fatigue makes it useless for detecting
outages right now
* improve developer experience:
* provide development/experimental VMs?
* give developers more tools to debug problems (e.g. grafana, stack
traces hidden in syslog)
* improve interaction between TPA and devs when new services are
setup
Also note the following 2020 goals that are not mentioned above and
might be added:
* improve communications and monitoring:
* document "downtimes of 1 hour or longer", in a status page [issue
40138](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40138)
* reduce alert fatigue in Nagios
* publicize debugging tools (Grafana, user-level logging in systemd
services)
* encourage communication and ticket creation
* move root@ and tpa "noise" to RT ([ticket 31242]( https://gitlab.torproject.org/tpo/tpa/team/-/issues/31242)), make a real
mailing list for admins so that gaba and non-tech can join
* plan for hiro's vacation (replacement?)
* moly retirement
* solr/search.tpo deployment
* web metrics (#32996)
* varnish to nginx conversion (#32462)
## Nice to have
## TODO: Need to have
## TODO: Nice to have
## TODO: Non-goals
# TODO: Quarterly breakdown
* improve sysadmin code base
* avoid YOLO commits in Puppet (possibly: server-side linting, CI)
* publish our Puppet repository ([ticket 29387](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387))
* reduce dependency on Python 2 code (see [short term LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#short-term-merge-with-upstream-port-to-python-3-if-necessary))
* reduce dependency on LDAP (move hosts to Puppet? see [mid term
LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#mid-term-move-hosts-to-puppet-possibly-replace-ud-ldap-with-simpler-dashboard))
* retire more old services:
* testnet?
* schleuder?
* provide secure, end-to-end authentication of Tor source code
([issue 81](https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/81))
* finish retiring old hardware (moly, [ticket 29974](https://gitlab.torproject.org/legacy/trac/-/issues/29974))
* varnish to nginx conversion (#32462)
* solr/search.tpo deployment (#33106)?
* web metrics (#32996)?
* GitLab pages hosting
* experiment with containers/kubernetes for CI/CD
## Non-goals
* complete email service: not enough time / budget (or delegate + pay Riseup)
* "provide development/experimental VMs": would be possible through
GitLab CD, to be investigated once we have GitLab CI solidly
running
* "improve interaction between TPA and devs when new services are
setup": see "improve communications" above, and "experimental
VMs". The endgame here is people will be able to deploy their own
services through Docker, but this will likely not happen in 2021
* static mirror network retirement / rearchitecture: we want to test
out GitLab pages first and see if it can provide a decent
alternative
* TODO: "finish main website transition", "broken links on
website"... should TPA cover for web stuff?
* TODO: are service admins still a thing? should we cover for things
like the metrics team?
* complete puppetization: old legacy services are not in Puppet. that
is fine: we keep maintaining them by hand when relevant, but new
services should all be built in Puppet
* replace Nagios with Prometheus: not a short term goal, no clear
benefit. reduce the noise in Nagios instead
# Quarterly breakdown
## Q1
......@@ -73,6 +116,17 @@ this roadmap is concerned. It should include items we are fairly
certain to be able to complete within the next few months or
so. Postponing those could cause problems.
* email delivery improvements:
* handle bounces in CiviCRM ([issue 33037](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33037))
* followup on abuse complaints
* diagnose and resolve delivery issue (e.g. [yahoo delivery
problems](https://gitlab.torproject.org/tpo/tpa/team/-/issues/34134))
* GitLab CI deployment, plan for Jenkins retirement
* setup a discourse instance, deprecate blog comments?
* plan for blog replacement?
* document "downtimes of 1 hour or longer", in a status page [issue
40138](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40138)
## Q2
Second quarter is a little more vague, but should still be
......@@ -80,6 +134,17 @@ Second quarter is a little more vague, but should still be
wait a little longer or that are part of longer projects that will
take longer to complete.
* retire old services:
* SVN ([issue 17202](https://gitlab.torproject.org/tpo/tpa/team/-/issues/17202))
* fpcentral ([issue 40009](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40009))
* establish plan for gitolite/gitweb retirement
* improve sysadmin code base
* avoid YOLO commits in Puppet (possibly: server-side linting, CI)
* publish our Puppet repository ([ticket 29387](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387))
* reduce dependency on Python 2 code (see [short term LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#short-term-merge-with-upstream-port-to-python-3-if-necessary))
* reduce dependency on LDAP (move hosts to Puppet? see [mid term
LDAP plan](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap#mid-term-move-hosts-to-puppet-possibly-replace-ud-ldap-with-simpler-dashboard))
## Q3
From our experience, after three quarters, things get difficult to
......@@ -88,13 +153,19 @@ time before this time, which totally changed basic assumptions about
worker availability and priorities.
Also, a global pandemic basically tore the world apart, throwing
everything in the air.
everything in the air, so obviously plans kind of went out the
window. Hopefully this won't happen again and the pandemic will
somewhat subside, but we should plan for the worst.
* jenkins retirement?
## Q4
Obviously, the fourth quarter is sheer crystal balling at this stage,
but it should still be an interesting exercise to perform.
* gitolite/gitweb retirement?
# 2020 roadmap evaluation
The following is a review of the 2020 roadmap.
......
......