TPA uses [Puppet](https://puppet.com/) to manage all servers it operates. It handles most of the configuration management of the base operating system and some services. It is *not* designed to handle ad hoc tasks, for which we favor the use of [fabric](howto/fabric). [[_TOC_]] # Tutorial This page is long! This first section hopes to get you running with a simple task quickly. ## Adding an IP address to the global allow list In this tutorial, we will add an IP address to the global allow list, on all firewalls on all machines. This is a big deal! It will allow that IP address to access the SSH servers on all boxes and more. This should be an **static** IP address on a trusted network. If you have never used Puppet before or are nervous at all about making such a change, it is a good idea to have a more experienced sysadmin nearby to help you. They can also confirm this tutorial is what is actually needed. 1. To any change on the Puppet server, you will first need to clone the git repository: git clone pauli.torproject.org:/srv/puppet.torproject.org/git/tor-puppet This needs to be only done once. 2. The firewall rules are defined in the `ferm` module, which lives in `modules/ferm`. The file you specifically need to change is `modules/ferm/templates/defs.conf.erb`, so open that in your editor of choice: $EDITOR modules/ferm/templates/defs.conf.erb 3. The code you are looking for is `ADMIN_IPS`. Add a `@def` for your IP address and add the new macro to the `ADMIN_IPS` macro. When you exit your editor, git should show you a diff that looks something like this: --- a/modules/ferm/templates/defs.conf.erb +++ b/modules/ferm/templates/defs.conf.erb @@ -77,7 +77,10 @@ def $TPO_NET = (<%= networks.join(' ') %>); @def $linus = (); @def $linus = ($linus 193.10.5.2/32); # kcmp@adbc @def $linus = ($linus 2001:6b0:8::2/128); # kcmp@adbc -@def $ADMIN_IPS = ($weasel $linus); +@def $anarcat = (); +@def $anarcat = ($anarcat 203.0.113.1/32); # home IP +@def $anarcat = ($anarcat 2001:DB8::DEAD/128 2001:DB8:F00F::/56); # home IPv6 +@def $ADMIN_IPS = ($weasel $linus $anarcat); @def $BASE_SSH_ALLOWED = (); 4. Then you can commit this and *push*: git commit -m'add my home address to the allow list' && git push 5. Then you should login to one of the hosts and make sure the code applies correctly: ssh -tt perdulce.torproject.org sudo puppet agent -t Puppet shows colorful messages. If nothing is red and it returns correctly, you are done. If that doesn't work, go back to step 2. If that doesn't work, ask for help from your colleague in the Tor sysadmin team. If this works, congratulations, you have made your first change across the entire Puppet infrastructure! You might want to look at the rest of the documentation to learn more about how to do different tasks and how things are setup. A key "How to" we recommend is the `Progressive deployment` section below, which will teach you how to make a change like the above while making sure you don't break anything even if it affects a lot of machines. # How to guides ## Modifying an existing configuration For new deployments, this is *NOT* the preferred method. For example, if you are deploying new software that is not already in use in our infrastructure, do *not* follow this guide and instead follow the `Adding a new module` guide below. If you are touching an *existing* configuration, things are much simpler however: you simply go to the module where the code already exists and make changes. You `git commit` and `git push` the code, then immediately run `puppet agent -t` on the affected node. Look at the `File layout` section above to find the right piece of code to modify. If you are making changes that potentially affect more than one host, you should also definitely look at the `Progressive deployment` section below. ## Adding a new module This is a broad topic, but let's take the Prometheus monitoring system as an example which followed the [role/profile/module][] pattern. First, the [Prometheus modules on the Puppet forge][] were evaluated for quality and popularity. There was a clear winner there: the [Prometheus module][] from [Vox Populi][] had hundreds of thousands more downloads than the [next option][], which was deprecated. [next option]: https://forge.puppet.com/brutus777/prometheus [Vox Populi]: https://voxpupuli.org/ [Prometheus module]: https://forge.puppet.com/puppet/prometheus [Prometheus modules on the Puppet forge]: https://forge.puppet.com/modules?q=prometheus Next, the module was added to the Puppetfile (in `3rdparty/Puppetfile`): mod 'puppet-prometheus', '6.4.0' ... and librarian was ran: librarian-puppet install This fetched a lot of code from the Puppet forge: the stdlib, archive and system modules were all installed or updated. All those modules were audited manually, by reading each file and looking for obvious security flaws or back doors. Then the code was committed into git: git add 3rdparty git commit -m'install prometheus module after audit' Then the module was configured in a profile, in `modules/profile/manifests/prometheus/server.pp`: class profile::prometheus::server { class { 'prometheus::server': # follow prom2 defaults localstorage => '/var/lib/prometheus/metrics2', storage_retention => '15d', } } The above contains our local configuration for the upstream `prometheus::server` class installed in the `3rdparty` directory. In particular, it sets a retention period and a different path for the metrics, so that they follow the new Prometheus 2.x defaults. Then this profile was added to a *role*, in `modules/roles/manifests/monitoring.pp`: # the monitoring server class roles::monitoring { include profile::prometheus::server } Notice how the role does not refer to any implementation detail, like that the monitoring server uses Prometheus. It looks like a trivial, useless, class but it can actually grow to include *multiple* profiles. Then that role is added to the Hiera configuration of the monitoring server, in `hiera/nodes/hetzner-nbg1-01.torproject.org.yaml`: classes: - roles::monitoring And Puppet was ran on the host, with: puppet --enable ; puppet agent -t --noop ; puppet --disable "testing prometheus deployment" This led to some problems as the upstream module doesn't support installing from Debian packages. Support for Debian was added to the code in `3rdparty/modules/prometheus`, and committed into git: emacs 3rdparty/modules/prometheus/manifests/*.pp # magic happens git commit -m'implement all the missing stuff' 3rdparty git push And the above puppet command-line was ran again, continuing that loop until things were good. If you need to deploy the code to multiple hosts, see the `Progressive deployment` section below. To contribute changes back upstream (and you should do so), see the section right below. ## Contributing changes back upstream For simple changes, the above workflow works well, but eventually it is preferable to actually fork the upstream repository and operate on our fork until the changes are merged upstream. First, the modified module is moved out of the way: mv 3rdparty/modules/prometheus{,.orig} The module is then forked on GitHub or wherever it is hosted, and then added to the Puppetfile: mod 'puppet-prometheus', :git => 'https://github.com/anarcat/puppet-prometheus.git', :branch => 'deploy' Then Librarian is ran again to fetch that code: librarian-puppet install Because Librarian is a little dumb, it might checkout your module in "detached head" mode, in which case you will want to fix the checkout: cd 3rdparty/modules/prometheus git checkout deploy git reset --hard origin/deploy git pull Note that the `deploy` branch here is a merge of all the different branches proposed upstream in different pull requests, but it could also be the `master` branch or a single branch if only a single pull request was sent. Since you now have a clone of the upstream repository, you can push and pull normally with upstream. When you make a change, however, you need to commit (and push) the change *both* in the sub-repository and the main repository: cd 3rdparty/modules/prometheus $EDITOR manifests/init.pp # more magic stuff git commit -m'change the frobatz to a argblu' git push cd .. git commit -m'change the frobatz to a argblu' git push Often, I make commits directly in our main Puppet repository, without pushing to the third party fork, until I am happy with the code, and then I craft a nice pretty commit that can be pushed upstream, reversing that process: $EDITOR 3rdparty/prometheus/manifests/init.pp # dirty magic stuff git commit -m'change the frobatz to a quuxblah' git push # see if that works, generally not git commit -m'rah. wanted a quuxblutz' git push # now we are good, update our pull request cd 3rdparty/modules/prometheus git commit -m'change the frobatz to a quuxblutz' git push It's annoying to double-commit things, but I haven't found a best way to do so just yet. This problem is further discussed in [ticket #29387][]. Also note that when you update code like this, the `Puppetfile` does not change, but the `Puppetfile.lock` file *does* change. The `GIT.sha` parameter needs to be updated. This can be done by hand, but since that is error-prone, you might want to simply run this to update modules: librarian-puppet update This will *also* update dependencies so make sure you audit those changes before committing and pushing. ## Running tests Ideally, Puppet modules have a test suite. This is done with [rspec-puppet](https://rspec-puppet.com/) and [rspec-puppet-facts](https://github.com/mcanevet/rspec-puppet-facts). This is not very well documented upstream, but it's apparently part of the [Puppet Development Kit](https://puppet.com/docs/pdk/1.x/pdk.html) (PDK). Anyways: assuming tests exists, you will want to run some tests before pushing your code upstream, or at least upstream might ask you for this before accepting your changes. Here's how to get setup: sudo apt install ruby-rspec-puppet ruby-puppetlabs-spec-helper ruby-bundler bundle install --path vendor/bundle This installs some basic libraries, system-wide (Ruby bundler and the rspec stuff). Unfortunately, required Ruby code is rarely all present in Debian and you still need to install extra gems. In this case we set it up within the `vendor/bundle` directory to isolate them from the global search path. Finally, to run the tests, you need to wrap your invocation with `bundle exec`, like so: bundle exec rake test ## Listing all hosts under puppet This will list all active hosts known to the Puppet master: ssh -t pauli.torproject.org 'sudo -u postgres psql puppetdb -P pager=off -A -t -c "SELECT c.certname FROM certnames c WHERE c.deactivated IS NULL"' The following will list all hosts under Puppet and their `virtual` value: ssh -t pauli.torproject.org "sudo -u postgres psql puppetdb -P pager=off -F',' -A -t -c \"SELECT c.certname, value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id INNER JOIN certnames c ON c.certname = fs.certname WHERE fp.name = 'virtual' AND c.deactivated IS NULL\"" | tee hosts.csv The resulting file is a Comma-Separated Value (CSV) file which can be used for other purposes later. Possible values of the `virtual` field can be obtain with a similar query: ssh -t pauli.torproject.org "sudo -u postgres psql puppetdb -P pager=off -A -t -c \"SELECT DISTINCT value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id WHERE fp.name = 'virtual';\"" The currently known values are: `kvm`, `physical`, and `xenu`. As a bonus, this query will show the number of hosts running each release: SELECT COUNT(c.certname), value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id INNER JOIN certnames c ON c.certname = fs.certname WHERE fp.name = 'lsbdistcodename' AND c.deactivated IS NULL GROUP BY value_string; ### Other ways of extracting a host list * Using the [PuppetDB API][]: curl -s -G http://localhost:8080/pdb/query/v4/facts | jq -r ".[].certname" The [fact API][] is quite extensive and allows for very complex queries. For example, this shows all hosts with the `apache2` fact set to `true`: curl -s -G http://localhost:8080/pdb/query/v4/facts --data-urlencode 'query=["and", ["=", "name", "apache2"], ["=", "value", true]]' | jq -r ".[].certname" This will list all hosts sorted by their report date, older first, followed by the timestamp, space-separated: curl -s -G http://localhost:8080/pdb/query/v4/nodes | jq -r 'sort_by(.report_timestamp) | .[] | "\(.certname) \(.report_timestamp)"' | column -s\ -t This will list all hosts with the `roles::static_mirror` class: curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { resources { type = "Class" and title = "Roles::Static_mirror" }} ' | jq .[].certname This will show all hosts running Debian buster: curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=nodes { facts { name = "lsbdistcodename" and value = "buster" }}' | jq .[].certname * Using [howto/cumin](howto/cumin) * Using LDAP: HOSTS=$(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b dc=torproject,dc=org -LLL "hostname=*.torproject.org" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort') for i in `echo $HOSTS`; do mkdir hosts/x-$i 2>/dev/null || continue; echo $i; ssh $i ' ...'; done the `mkdir` is so that I can run the same command in many terminal windows and each host gets only one once [PuppetDB API]: https://puppet.com/docs/puppetdb/4.3/api/index.html [fact API]: https://puppet.com/docs/puppetdb/4.3/api/query/v4/facts.html ## Batch jobs on all hosts With that trick, a job can be ran on all hosts with [parallel-ssh][], for example, check the `uptime`: cut -d, -f1 hosts.hsv | parallel-ssh -i -h /dev/stdin uptime This would do the same, but only on physical servers: grep 'physical$' hosts.hsv | cut -d -f1 | parallel-ssh -i -h /dev/stdin uptime This would fetch the `/etc/motd` on all machines: cut -d -f1 hosts.csv | parallel-slurp -h /dev/stdin -L motd /etc/motd motd To run batch commands through `sudo` that requires a password, you will need to fool both `sudo` and ssh a little more: cut -d -f1 hosts.csv | parallel-ssh -P -I -i -x -tt -h /dev/stdin -o pvs sudo pvs You should then type your password then Control-d. Warning: this will show your password on your terminal and probably in the logs as well. Batch jobs can also be ran on all Puppet hosts with Cumin: ssh -N -L8080:localhost:8080 pauli.torproject.org & cumin '*' uptime See [howto/cumin](howto/cumin) for more examples. [parallel-ssh]: https://parallel-ssh.org/ ## Progressive deployment If you are making a major change to the infrastructure, you may want to deploy it progressively. A good way to do so is to include the new class manually in the node configuration, say in `hiera/nodes/$fqdn.yaml`: classes: - my_new_class Then you can check the effect of the class on the host with the `--noop` mode. Make sure you disable Puppet so that automatic runs do not actually execute the code, with: puppet agent --disable "testing my_new_class deployment" Then the new manifest can be simulated with this command: puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment" Examine the output and, once you are satisfied, you can re-enable the agent and actually run the manifest with: puppet agent --enable ; puppet agent -t If the change is *inside* an existing class, that change can be enclosed in a class parameter and that parameter can be passed as an argument from Hiera. This is how the transition to a managed `/etc/apt/sources.list` file was done: 1. first, a parameter was added to the class that would remove the file, defaulting to `false`: class torproject_org( Boolean $manage_sources_list = false, ) { if $manage_sources_list { # the above repositories overlap with most default sources.list file { '/etc/apt/sources.list': ensure => absent, } } } 2. then that parameter was enabled on one host, say in `hiera/nodes/brulloi.torproject.org.yaml`: torproject_org::manage_sources_list: true 3. Puppet was run on that host using the simulation mode: puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment" 4. when satisfied, the real operation was done: puppet agent --enable ; puppet agent -t --noop 5. then this was added to two other hosts, and Puppet was ran there 6. finally, all hosts were checked to see if the file was present on hosts and had any content, with [howto/cumin](howto/cumin) (see above for alternative way of running a command on all hosts): cumin '*' 'du /etc/apt/sources.list' 7. since it was missing everywhere, the parameter was set to `true` by default and the custom configuration removed from the three test nodes 8. then Puppet was ran by hand everywhere, using Cumin, with a batch of 5 hosts at a time: cumin -o txt -b 5 '*' 'puppet agent -t' because Puppet returns a non-zero value when changes are made, this will above when any one host in a batch of 5 will actually operate a change. You can then examine the output and see if the change is legitimate or abort the configuration change. ## Debugging things When a Puppet manifest is not behaving as it should, the first step is to run it by hand on the host: puppet agent -t If that doesn't yield enough information, you can see pretty much everything that Puppet does with the `--debug` flag. This will, for example, include `Exec` resources `onlyif` commands and allow you to see why they do not work correctly (a common problem): puppet agent -t --debug Finally, some errors show up only on the Puppet server: you can look in `/var/log/daemon.log` there for errors that will only show up there. Connecting to the PuppetDB database itself can sometimes be easier than trying to operate the API. There you can inspect the entire thing as a normal SQL database, use this to connect: sudo -u postgres psql puppetdb It's possible exported resources do surprising things sometimes. It is useful to look at the actual PuppetDB to figure out which tags exported resources have. For example, this query lists all exported resources with `troodi` in the name: SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE exported = 't' AND title LIKE '%troodi%'; Keep in mind that there are [automatic tags](https://puppet.com/docs/puppet/6.4/lang_tags.html) in exported resources which can complicate things. ## Password management If you need to set a password in a manifest, there are special functions to handle this. We do not want to store passwords directly in Puppet source code, for various reasons: it is hard to erase because code is stored in git, but also, ultimately, we want to publish that source code publicly. We have two mechanisms on how to do this now: a HKDF to generate passwords by hashing a common secret, and Trocla, which generates random passwords and stores the hash or, if necessary, the clear-text in a YAML file.. The HKDF function is deprecated and should be [replaced by Trocla][trocla-migration] eventually. [trocla-migration]: https://bugs.torproject.org/30009 ### hkdf NOTE: this procedure is DEPRECATED and Trocla should be used instead, see the [trocla migration ticket][trocla-migration] for details. Old passwords in Puppet are managed through a [Key Derivation Function][] (KDF), more specifically a [hash-based KDF][] that takes a secret stored on the Puppet master (in `/etc/puppet/secret`) concatenates this with a unique token picked by the caller, and generates a secret unique to that token. An example: [hash-based KDF]: https://en.wikipedia.org/wiki/HKDF [Key Derivation Function]: https://en.wikipedia.org/wiki/Key_derivation_function $secret = hkdf('/etc/puppet/secret', "dip-${::hostname}-base-secret") This generates a unique passwords for the given token. The password is then used, in clear text, by the puppet client as appropriate. The function is an implementation of [RFC5869][], a [SHA256][]-based HKDF taken from an earlier version of [John Downey's Rubygems implementation][]. [John Downey's Rubygems implementation]: https://rubygems.org/gems/hkdf [RFC5869]: https://tools.ietf.org/html/rfc5869 [SHA256]: https://en.wikipedia.org/wiki/SHA-2 ### Trocla [Trocla][] is another password-management solution that takes another approach. With Trocla, each password is generated on the fly from a secure entropy source ([Ruby's SecureRandom module][]) and stored inside a state file (in `/var/lib/trocla/trocla_data.yml`, configured `/etc/puppet/troclarc.yaml`) on the Puppet master. Trocla can return "hashed" versions of the passwords, so that the plain text password is never visible from the client. The plain text can still be stored on the Puppet master, or it can be deleted once it's been transmitted to the user or another password manager. This makes it possible to have Trocla not keep any secret at all. [Ruby's SecureRandom module]: https://ruby-doc.org/stdlib-1.9.3/libdoc/securerandom/rdoc/SecureRandom.html [Trocla]: https://github.com/duritong/trocla This piece of code will generate a [bcrypt][]-hashed password for the Grafana admin, for example: $grafana_admin_password = trocla('grafana_admin_password', 'bcrypt') The plain-text for that password will never leave the Puppet master. it will still be stored on the Puppet master, and you can see the value with: trocla get grafana_admin_password plain ... on the command-line. [bcrypt]: https://en.wikipedia.org/wiki/Bcrypt A password can also be set with this command: trocla set grafana_guest_password plain Note that this might *erase* other formats for this password, although those will get regenerated as needed. Also note that `trocla get` will fail if the particular password or format requested does not exist. For example, say you generate a plain-text password with and then get the `bcrypt` version: trocla create test plain trocla get test bcrypt This will return the empty string instead of the hashed version. Instead, use `trocla create` to generate that password. In general, it's safe to use `trocla create` as it will reuse existing password. It's actually how the `trocla()` function behaves in Puppet as well. ## Getting facts from other hosts TODO: expand. ``` 02:37:52 <bastelfreak> anarcat: query_nodes('Class[Profiles::Kafkabroker]') gets you all FQDNs from all nodes with that class in the catalog 02:38:52 <bastelfreak> anarcat: for all ips: $ipfact = 'networking.interfaces.enp0s5.ip6' \n query_nodes('Class[Profiles::Cephmon]', $ipfact) 02:39:24 <bastelfreak> anarcat: and something like this if you want ips from all nodes except the current ones (e.g. for firewalling): $elknodeips = query_nodes("Class[Profiles::Elasticsearch] and ${ipfact} != '${ipv6}'", $ipfact) 09:09:34 <bastelfreak> you can pass any factname 09:09:40 <bastelfreak> but please don't use legacy facts 09:09:44 <bastelfreak> use networking.ip 09:09:51 <bastelfreak> or networking.ip6 ! ``` ## Revoking and generating a new certificate for a host Revocation procedures problems were discussed in [33587][] and [33446][]. [33587]: https://bugs.torproject.org/33587 [33446]: https://gitlab.torproject.org/legacy/trac/-/issues/33446#note_2349434 1. Clean the certificate on the master puppet cert clean host.torproject.org 2. Clean the certificate on the client: find /var/lib/puppet/ssl -name host.torproject.org.pem -delete 3. Then run the bootstrap script on the client from `tsa-misc/installer/puppet-bootstrap-client ` and get a new checksum 4. Run `tpa-puppet-sign-client` on the master and pass the checksum 5. Run `puppet agent -t` to have puppet running on the client again. ## Pager playbook ### catalog run: PuppetDB warning: did not update since... If you see an error like: Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z It can also be eventually accompanied with the puppet server reporting the same problem: Subject: ** PROBLEM Service Alert: pauli/puppet - all catalog runs is WARNING ** [...] Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z One of the following is happening, in decreasing likeliness: 1. the node's Puppet manifest has an error of some sort that makes it impossible to run the catalog 2. the node is down and has failed to report since the last time specified 3. the Puppet **server** is down and **all** nodes will fail to report in the same way (in which case a lot more warnings will show up, and other warnings about the server will come in) The first situation will usually happen after someone pushed a commit introducing the error. We try to keep all manifests compiling all the time and such errors should be immediately fixed. Look at the history of the Puppet source tree and try to identify the faulty commit. Reverting such a commit is acceptable to restore the service. The second situation can happen if a node is in maintenance for an extended duration. Normally, the node will recover when it goes back online. If a node is to be permanently retired, it should be removed from Puppet, using the [host retirement procedures][retire-a-host]. Finally, if the main Puppet **server** is down, it should definitely be brought back up. See disaster recovery, below. In any case, running the Puppet agent on the affected node should give more information: ssh NODE puppet agent -t ## Disaster recovery <!-- what to do if all goes to hell. e.g. restore from backups? --> <!-- rebuild from scratch? not necessarily those procedures (e.g. see --> <!-- "Installation" below but some pointers. --> TODO. # Reference This documents generally how things are setup. ## Installation TODO. It is not yet clear how the Puppetmaster was setup or how to build a new one. The interactions with other tools like Nagios and LDAP especially need to be documented. ## SLA No formal SLA is defined. Puppet runs on a fairly slow `cron` job so doesn't have to be highly available right now. This could change in the future if we rely more on it for deployments. ## Design <!-- how this is built --> <!-- should reuse and expand on the "proposed solution", it's a --> <!-- "as-built" documented, whereas the "Proposed solution" is an --> <!-- "architectural" document, which the final result might differ --> <!-- from, sometimes significantly --> <!-- a good guide to "audit" an existing project's design: --> <!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html --> ### Before it all starts - `puppet.tpo` is currently being run on `pauli.tpo` - This is where the tor-puppet git repository lives - The repository has hooks to populate `/etc/puppet` with its contents, most notably the modules directory. - All paths in this document are relative to the root of this repository. ### File layout - `3rdparty/modules` include modules that are shared publicly and do not contain any TPO-specific configuration. There is a `Puppetfile` there that documents where each module comes from and that can be maintained with [r10k][] or [librarian][]. [librarian]: https://librarian-puppet.com/ [r10k]: https://github.com/puppetlabs/r10k/ - `modules` includes roles, profiles, and classes that make the bulk of our configuration. - in there, the `roles` class (`modules/roles/manifests/init.pp`) maps services to roles, using the `$nodeinfo` variable. - The `torproject_org` module (`modules/torproject_org/manifests/init.pp`) performs basic host initialisation, like configuring Debian mirrors and APT sources, installing a base set of packages, configuring puppet and timezone, setting up a bunch of configuration files and running `ud-replicate`. - In there, `local.yaml` (`modules/torproject_org/misc/local.yaml`) defines services and list which host(s) supply each service. `local.yaml` is read by the `roles` class above for setting up the `$localinfo` and `$nodeinfo` variables. It also defines the `$roles` parameter and defines `ferm` macros. - There is also the `hoster.yaml` file (`modules/torproject_org/misc/hoster.yaml`) which defines hosting providers and specifies things like which network blocks they use, if they have a DNS resolver or a Debian mirror. `hoster.yaml` is read by - the `nodeinfo()` function (`modules/puppetmaster/lib/puppet/parser/functions/nodeinfo.rb`), used for setting up the `$nodeinfo` variable - `ferm`'s `def.conf` template (`modules/ferm/templates/defs.conf.erb`) - The root of definitions and execution is in Puppet is found in the `manifests/site.pp` file, but this file is now mostly empty, in favor of Hiera. Note that the above is the current state of the file hierarchy. As part of the transition to Hiera, a lot of the above architecture will change in favor of the more standard [role/profile/module][] pattern. See [ticket #29387][] for an in-depth discussion. [role/profile/module]: https://puppet.com/docs/pe/2017.2/r_n_p_intro.html [ticket #29387]: https://bugs.torproject.org/29387 ### Custom facts `modules/torproject_org/lib/facter/software.rb` defines our custom facts, making it possible to get answer to questions like "Is this host running `apache2`?" by simply looking at a puppet variable. ### Style guide Puppet manifests should generally follow the [Puppet style guide][]. This can be easily done with [Flycheck][] in Emacs, [vim-puppet][], or a similar plugin in your favorite text editor. Many files do not *currently* follow the style guide, as they *predate* the creation of said guide. Files should *not* be completely reformatted unless there's a good reason. For example, if a conditional covering a large part of a file is removed and the file needs to be re-indented, it's a good opportunity to fix style in the file. Same if a file is split in two components or for some other reason completely rewritten. Otherwise the style already in use in the file should be followed. [Puppet style guide]: https://puppet.com/docs/puppet/4.8/style_guide.html [Flycheck]: http://flycheck.org/ [vim-puppet]: https://github.com/rodjek/vim-puppet ### Hiera [Hiera][] is a "key/value lookup tool for configuration data" which Puppet uses to look up values for class parameters and node configuration in General. We are in the process of transitioning over to this mechanism from our previous set of custom YAML lookup system. This documents the way we currently use Hiera. [Hiera]: https://puppet.com/docs/hiera/3.2/ #### Classes definitions Each host declares which class it should include through a `classes` parameter. For example, this is what configures a Prometheus server: classes: - roles::monitoring Roles should be *abstract* and *not* implementation specific. Each role includes a set of profiles which *are* implementation specific. For example, the `monitoring` role includes `profile::prometheus::server` and `profile::grafana`. Do *not* include profiles directly from Hiera. As a temporary exception to this rule, old modules can be included as we transition from the `has_role` mechanism to Hiera, but eventually those should be ported to shared modules from the Puppet forge, with our glue built into a profile on top of the third-party module. The role `roles::monitoring` follows that pattern correctly. #### Node configuration On top of the host configuration, some node-specific configuration can be performed from Hiera. This should be avoided as much as possible, but sometimes there is just no other way. A good example was the `build-arm-*` nodes which included the following configuration: bacula::client::ensure: "absent" This disables backups on those machines, which are normally configured everywhere. This is done because they are behind a firewall and therefore not reachable, an unusual condition in the network. Another example is `nutans` which sits behind a NAT so it doesn't know its own IP address. To export proper firewall rules, the allow address has been overridden as such: bind::secondary::allow_address: 89.45.235.22 Those types of parameters are normally automatically guess inside modules' classes, but they are overriddable from Hiera. Note: eventually *all* host configuration will be done here, but there are currently still some configurations hardcoded in individual modules. For example, the Bacula director is hardcoded in the `bacula` base class (in `modules/bacula/manifests/init.pp`). That should be moved into a class parameter, probably in `common.yaml`. ### Cron and scheduling The Puppet agent is *not* running as a daemon, it's running through good old `cron`. Puppet runs on each node every four hour, although with a random 2h jitter, so the actual frequency is somewhere between 4 and 6 hours. This configuration is in `/etc/cron.d/puppet-crontab` and deployed by Puppet itself, currently as part of the `torproject_org` module. ## Issues There is no issue tracker specifically for this project, [File][] or [search][] for issues in the [team issue tracker][search] component. [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues ## Monitoring and testing Puppet is hooked into Nagios in two ways: * one job runs on the Puppetmaster and checks PuppetDB for reports. this was done with a [patched](https://github.com/evgeni/check_puppetdb_nodes/pull/14) version of the [check_puppetdb_nodes](https://github.com/evgeni/check_puppetdb_nodes/_) Nagios check, now packaged inside the `tor-nagios-checks` Debian package * another job runs on each Puppet node and will therefore work even if the Puppetmaster dies for some reason. this is done with the [check_puppet_agent](https://github.com/aswen/nagios-plugins/blob/master/check_puppet_agent) Nagios check, now also packaged inside the `tor-nagios-checks` Debian package This was [implemented in March 2019](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29676). An alternative implementation [using Prometheus](https://forge.puppet.com/puppet/prometheus_reporter) was considered but [Prometheus still hasn't replaced Nagios](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864) at the time of writing. There are no validation checks and *a priori* no peer review of code: code is directly pushed to the Puppet server without validation. Work is being done to [implement automated checks](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31226) but that is only being deployed on some clients for now. # Discussion This section goes more in depth into how Puppet is setup, why it was setup the way it was, and how it could be improved. ## Overview Our Puppet setup dates back from 2011, according to the git history, and was probably based off the [Debian System Administrator's Puppet codebase](https://salsa.debian.org/dsa-team/mirror/dsa-puppet) which dates back to 2009. ## Goals The general goal of Puppet is to provide basic automation across the architecture, so that software installation and configuration, file distribution, user and some service management is done from a central location, managed in a git repository. This approach is often called [Infrastructure as code](https://en.wikipedia.org/wiki/Infrastructure_as_Code). ### Must have TODO. ### Nice to have TODO. ### Non-Goals TODO. ## Approvals required TPA should approve policy changes as per [tpa-rfc-1](/policy/tpa-rfc-1-policy). ## Proposed Solution N/A. ## Cost N/A. ## Alternatives considered Ansible was considered for managing [GitLab](gitlab) for a while, but this was eventually abandoned in favor of using Puppet and the "Omnibus" package. For ad hoc jobs, [fabric](fabric) is being used.