Raw import from Trac using Trac markup language. authored by Alexander Hansen Færøy's avatar Alexander Hansen Færøy
= About us =
The sysadmin team is responsible for managing machines under the `torproject.org` domain. It does ''not'' operate the Tor network in any form nor is it responsible for ''all'' services running on `torproject.org`: that is the job of the various service admins responsible of those services.
Most of the documentation of the sysadmin team is in a [https://help.torproject.org/tsa/ a different wiki] for now.
[[TOC]]
= Roadmap =
This page documents a possible roadmap for the TPA team for the year 2020.
Items should be [https://en.wikipedia.org/wiki/SMART_criteria SMART], that is:
* specific
* measurable
* achievable
* relevant
* time-bound
Main objectives (need to have):
* decommissining of old machines (moly in particular)
* move critical services in ganeti
* buster upgrades before LTS
* within budget
Secondary objectives (nice to have):
* new mail service
* conversion of the kvm* fleet to ganeti for higher reliability and availability
* buster upgrade completion before anarcat vacation
Non-objective:
* service admin roadmapping?
* kubernetes cluster deployment?
Assertions:
* new gnt-fsn nodes with current hardware (PX62-NVMe, 118EUR/mth), cost savings possible with the AX line (-20EUR/mth) or by reducing disk space requirements (-39EUR/mth) per node
* cymru actually delivers hardware and is used for moly decom
* gitlab hardware requirements covered by another budget
* we absorb the extra bandwidth costs from the new hardware design (currently 38EUR per month but could rise when new bandwidth usage comes in) - could be shifted to TBB team or at least labeled as such
== TODO ==
* nextcloud roadmap
* identify critical services and realistic improvements #31243 (done)
* (anarcat & gaba) sort out each month by priority (mostly done for feb/march)
* (gaba) add keywords #tpa-roadmap- for each month (doing for february and march to test how this would work) (done)
* (anarcat) create missing tickets for february/march (partially done, missing some from hiro)
* (at tpa meeting) estimate tickets! (1pt = 1 day)
* (gaba) reorganize
[https://nc.torproject.net/apps/onlyoffice/7374?filePath=%2FTeams%2FSysadmin%2FBudget%20Sysadmin.xlsx budget file] per month
* (gaba) create a roadmap for gitlab migration
* (gaba) find service admins for gitlab (nobody for trac in [https://trac.torproject.org/projects/tor/wiki/org/operations/services services page]) - gaba to talk with isa and alex and look for service admins (sent a mail to las vegas but nobody replied... I will talk with each team lead)
* have a shell account in the server
* restart/stop service
* upgrade services
* problems with the service
== January ==
* [x] catchup after holidays
* [x] agree internally on a roadmap for 2020
* [x] first phase of installer automation (setup-storage and friends) #31239
* [x] new FSN node in the Ganeti cluster (fsn-node-03) #32937
* [x] textile shutdown and VM relocation, 2 VMs to migrate #31686 (+86EUR)
* [x] enable needrestart fleet-wide (#31957)
* [x] review website build errors (#32996)
* [x] evaluate if discourse can be used as comments platform for the blog (#33105) <-- can we move this further down the road (not february) until gitlab is migrated? -->
* [x] communicate buster upgrade timeline to service admins DONE
* [x] buster upgrade 63% done: 48 buster, 28 stretch machines
== February ==
capacity around 15 days (counting 2.5 days per week for anarcat and 5 days per month for hiro)
[[TicketQuery(keywords~=tpa-roadmap-february,format=progress)]]
* 2020 roadmap officially adopted - done
* second phase of installer automation #31239 (esp. puppet automation, e.g. #32901, #32914) - done
* new gnt-fsn node (fsn-node-04) -118EUR=+40EUR (#33081) - done
* storm shutdown #32390 - done
* unifolium decom (after storm), 5 VMs to migrate, #33085 +72EUR=+158EUR - not completed
* buster upgrade 70% done: 53 buster (+5), 23 stretch (-5) - done: 54 buster (+6), 22 stretch (-6), 1 jessie
* migrate gitlab-01 to a new VM (gitlab-02) and use the omnibus package instead of ansible (#32949) - done
* migrate CRM machines to gnt and test with Giant Rabbit #32198 (priority) - not done
* automate upgrades: enable unattended-upgrades fleet-wide (#31957 ) - not done
* anti-censorship monitoring (external prometheus setup assistance) #31159 - not done
[[TicketQuery(keywords~=tpa-roadmap-february,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== March ==
capacity around 15 days (counting 2.5 days per week for anarcat and 5 days per month for hiro)
[[TicketQuery(keywords~=tpa-roadmap-march,format=progress)]]
High possibility of overload here (two major decoms and many machines setup). Possible to push moly/cymru work to april?
* 2021 budget proposal?
* possible gnt-cymru cluster setup (~6 machines) #29397
* moly decom #29974, 5 VMs to migrate
* kvm3 decom, 7 VMs to migrate (inc. crm-int and crm-ext), #33082 +72EUR=+112EUR
* new gnt-fsn node (fsn-node-05) #33083 -118EUR=-6EUR
* eugeni VM migration to gnt-fsn #32803
* buster upgrade 80% done: 61 buster (+8), 15 stretch (-8)
* solr deployment (#33106)
* anti-censorship monitorining (external prometheus setup assistance) #31159
* nc.riseup.net cleanup #32391
* SVN shutdown? #17202
[[TicketQuery(keywords~=tpa-roadmap-march,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== April ==
[[TicketQuery(keywords~=tpa-roadmap-april,format=progress)]]
* kvm4 decom, 9 VMs to migrate #32802 (w/o eugeni), +121EUR=+115EUR
* new gnt-fsn node (fsn-node-06) -118EUR=-3EUR
* buster upgrade 90% done: 68 buster (+7), 8 stretch (-7)
* solr configuration
[[TicketQuery(keywords~=tpa-roadmap-april,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== May ==
[[TicketQuery(keywords~=tpa-roadmap-may,format=progress)]]
* kvm5 decom, 9 VMs to migrate #33084, +111EUR=+108EUR
* new gnt-fsn node (fsn-node-07) -118EUR=-10EUR
* buster upgrade 100% done: 76 buster (+8), 0 stretch (-8)
* current planned completion date of Buster upgrades
* start ramping down work, training and documentation
* solr text updates and maintenance
[[TicketQuery(keywords~=tpa-roadmap-may,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== June ==
[[TicketQuery(keywords~=tpa-roadmap-june,format=progress)]]
* Debian jessie LTS EOL, chiwui forcibly shutdown #29399
* finish ramp-down, final bugfixing and training before vacation
* search.tp.o soft launch
[[TicketQuery(keywords~=tpa-roadmap-june,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== July ==
[[TicketQuery(keywords~=tpa-roadmap-july,format=progress)]]
* Debian stretch EOL, final deadline for buster upgrades
* anarcat vacation
* tor meeting?
* hiro tentative vacations
[[TicketQuery(keywords~=tpa-roadmap-july,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== August ==
[[TicketQuery(keywords~=tpa-roadmap-august,format=progress)]]
* anarcat vacation
* web metrics R&D (investigate a platform for web metrics) (#32996)
[[TicketQuery(keywords~=tpa-roadmap-august,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== September ==
[[TicketQuery(keywords~=tpa-roadmap-september,format=progress)]]
* plan contingencies for christmas holidays
* catchup following vacation
* web metrics deployment
[[TicketQuery(keywords~=tpa-roadmap-september,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== October ==
[[TicketQuery(keywords~=tpa-roadmap-october,format=progress)]]
* puppet work (finish prometheus module development, puppet environments, trocla, Hiera, publish code #29387)
* varnish to nginx conversion #32462
* web metrics soft launch (in time for eoy campaign)
* submit service R&D #30608
[[TicketQuery(keywords~=tpa-roadmap-october,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== November ==
[[TicketQuery(keywords~=tpa-roadmap-november,format=progress)]]
* first submit service prototype? #30608
[[TicketQuery(keywords~=tpa-roadmap-november,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== December ==
[[TicketQuery(keywords~=tpa-roadmap-december,format=progress)]]
* stabilisation & bugfixing
* 2021 roadmapping
* one or two week xmas holiday
* CCC?
[[TicketQuery(keywords~=tpa-roadmap-december,format=table,order=priority,changetime,desc=false,col=id|summary|status|points|actualpoints|priority|severity|changetime|sponsor,group=owner,max=100)]]
== 2021 preview ==
Objectives:
* complete puppetization
* experiment with containers/kubernetes?
* close and merge more services
* replace nagios with prometheus? #29864
* new hire?
Monhtly goals:
* january: roadmap approval
* march/april: anarcat vacation