TPA uses [Puppet](https://puppet.com/) to manage all servers it operates. It handles most of the configuration management of the base operating system and some services. It is *not* designed to handle ad-hoc tasks, for which we favor the use of [fabric](howto/fabric). [[_TOC_]] # Tutorial This page is long! This first section hopes to get you running with a simple task quickly. ## Adding an IP address to the global allow list In this tutorial, we will add an IP address to the global allow list, on all firewalls on all machines. This is a big deal! It will allow that IP address to access the SSH servers on all boxes and more. This should be an **static** IP address on a trusted network. If you have never used Puppet before or are nervous at all about making such a change, it is a good idea to have a more experienced sysadmin nearby to help you. They can also confirm this tutorial is what is actually needed. 1. To any change on the Puppet server, you will first need to clone the git repository: git clone pauli.torproject.org:/srv/puppet.torproject.org/git/tor-puppet This needs to be only done once. 2. The firewall rules are defined in the `ferm` module, which lives in `modules/ferm`. The file you specifically need to change is `modules/ferm/templates/defs.conf.erb`, so open that in your editor of choice: $EDITOR modules/ferm/templates/defs.conf.erb 3. The code you are looking for is `ADMIN_IPS`. Add a `@def` for your IP address and add the new macro to the `ADMIN_IPS` macro. When you exit your editor, git should show you a diff that looks something like this: --- a/modules/ferm/templates/defs.conf.erb +++ b/modules/ferm/templates/defs.conf.erb @@ -77,7 +77,10 @@ def $TPO_NET = (<%= networks.join(' ') %>); @def $linus = (); @def $linus = ($linus 193.10.5.2/32); # kcmp@adbc @def $linus = ($linus 2001:6b0:8::2/128); # kcmp@adbc -@def $ADMIN_IPS = ($weasel $linus); +@def $anarcat = (); +@def $anarcat = ($anarcat 203.0.113.1/32); # home IP +@def $anarcat = ($anarcat 2001:DB8::DEAD/128 2001:DB8:F00F::/56); # home IPv6 +@def $ADMIN_IPS = ($weasel $linus $anarcat); @def $BASE_SSH_ALLOWED = (); 4. Then you can commit this and *push*: git commit -m'add my home address to the allow list' && git push 5. Then you should login to one of the hosts and make sure the code applies correctly: ssh -tt perdulce.torproject.org sudo puppet agent -t Puppet shows colorful messages. If nothing is red and it returns correctly, you are done. If that doesn't work, go back to step 2. If that doesn't work, ask for help from your colleague in the Tor sysadmin team. If this works, congratulations, you have made your first change across the entire Puppet infrastructure! You might want to look at the rest of the documentation to learn more about how to do different tasks and how things are setup. A key "How to" we recommend is the `Progressive deployment` section below, which will teach you how to make a change like the above while making sure you don't break anything even if it affects a lot of machines. # How-to ## Modifying an existing configuration For new deployments, this is *NOT* the preferred method. For example, if you are deploying new software that is not already in use in our infrastructure, do *not* follow this guide and instead follow the `Adding a new module` guide below. If you are touching an *existing* configuration, things are much simpler however: you simply go to the module where the code already exists and make changes. You `git commit` and `git push` the code, then immediately run `puppet agent -t` on the affected node. Look at the `File layout` section above to find the right piece of code to modify. If you are making changes that potentially affect more than one host, you should also definitely look at the `Progressive deployment` section below. ## Adding a new module This is a broad topic, but let's take the Prometheus monitoring system as an example which followed the [role/profile/module][] pattern. First, the [Prometheus modules on the Puppet forge][] were evaluated for quality and popularity. There was a clear winner there: the [Prometheus module][] from [Vox Populi][] had hundreds of thousands more downloads than the [next option][], which was deprecated. [next option]: https://forge.puppet.com/brutus777/prometheus [Vox Populi]: https://voxpupuli.org/ [Prometheus module]: https://forge.puppet.com/puppet/prometheus [Prometheus modules on the Puppet forge]: https://forge.puppet.com/modules?q=prometheus Next, the module was added to the Puppetfile (in `3rdparty/Puppetfile`): mod 'puppet-prometheus', '6.4.0' ... and librarian was ran: librarian-puppet install This fetched a lot of code from the Puppet forge: the stdlib, archive and system modules were all installed or updated. All those modules were audited manually, by reading each file and looking for obvious security flaws or back doors. Then the code was committed into git: git add 3rdparty git commit -m'install prometheus module after audit' Then the module was configured in a profile, in `modules/profile/manifests/prometheus/server.pp`: class profile::prometheus::server { class { 'prometheus::server': # follow prom2 defaults localstorage => '/var/lib/prometheus/metrics2', storage_retention => '15d', } } The above contains our local configuration for the upstream `prometheus::server` class installed in the `3rdparty` directory. In particular, it sets a retention period and a different path for the metrics, so that they follow the new Prometheus 2.x defaults. Then this profile was added to a *role*, in `modules/roles/manifests/monitoring.pp`: # the monitoring server class roles::monitoring { include profile::prometheus::server } Notice how the role does not refer to any implementation detail, like that the monitoring server uses Prometheus. It looks like a trivial, useless, class but it can actually grow to include *multiple* profiles. Then that role is added to the Hiera configuration of the monitoring server, in `hiera/nodes/hetzner-nbg1-01.torproject.org.yaml`: classes: - roles::monitoring And Puppet was ran on the host, with: puppet --enable ; puppet agent -t --noop ; puppet --disable "testing prometheus deployment" This led to some problems as the upstream module doesn't support installing from Debian packages. Support for Debian was added to the code in `3rdparty/modules/prometheus`, and committed into git: emacs 3rdparty/modules/prometheus/manifests/*.pp # magic happens git commit -m'implement all the missing stuff' 3rdparty git push And the above puppet command-line was ran again, continuing that loop until things were good. If you need to deploy the code to multiple hosts, see the `Progressive deployment` section below. To contribute changes back upstream (and you should do so), see the section right below. ## Contributing changes back upstream For simple changes, the above workflow works well, but eventually it is preferable to actually fork the upstream repository and operate on our fork until the changes are merged upstream. First, the modified module is moved out of the way: mv 3rdparty/modules/prometheus{,.orig} The module is then forked on GitHub or wherever it is hosted, and then added to the Puppetfile: mod 'puppet-prometheus', :git => 'https://github.com/anarcat/puppet-prometheus.git', :branch => 'deploy' Then Librarian is ran again to fetch that code: librarian-puppet install Because Librarian is a little dumb, it might checkout your module in "detached head" mode, in which case you will want to fix the checkout: cd 3rdparty/modules/prometheus git checkout deploy git reset --hard origin/deploy git pull Note that the `deploy` branch here is a merge of all the different branches proposed upstream in different pull requests, but it could also be the `master` branch or a single branch if only a single pull request was sent. Since you now have a clone of the upstream repository, you can push and pull normally with upstream. When you make a change, however, you need to commit (and push) the change *both* in the sub-repository and the main repository: cd 3rdparty/modules/prometheus $EDITOR manifests/init.pp # more magic stuff git commit -m'change the frobatz to a argblu' git push cd .. git commit -m'change the frobatz to a argblu' git push Often, I make commits directly in our main Puppet repository, without pushing to the third party fork, until I am happy with the code, and then I craft a nice pretty commit that can be pushed upstream, reversing that process: $EDITOR 3rdparty/prometheus/manifests/init.pp # dirty magic stuff git commit -m'change the frobatz to a quuxblah' git push # see if that works, generally not git commit -m'rah. wanted a quuxblutz' git push # now we are good, update our pull request cd 3rdparty/modules/prometheus git commit -m'change the frobatz to a quuxblutz' git push It's annoying to double-commit things, but I haven't found a best way to do so just yet. This problem is further discussed in [ticket #29387][]. Also note that when you update code like this, the `Puppetfile` does not change, but the `Puppetfile.lock` file *does* change. The `GIT.sha` parameter needs to be updated. This can be done by hand, but since that is error-prone, you might want to simply run this to update modules: librarian-puppet update This will *also* update dependencies so make sure you audit those changes before committing and pushing. ## Running tests Ideally, Puppet modules have a test suite. This is done with [rspec-puppet](https://rspec-puppet.com/) and [rspec-puppet-facts](https://github.com/mcanevet/rspec-puppet-facts). This is not very well documented upstream, but it's apparently part of the [Puppet Development Kit](https://puppet.com/docs/pdk/1.x/pdk.html) (PDK). Anyways: assuming tests exists, you will want to run some tests before pushing your code upstream, or at least upstream might ask you for this before accepting your changes. Here's how to get setup: sudo apt install ruby-rspec-puppet ruby-puppetlabs-spec-helper ruby-bundler bundle install --path vendor/bundle This installs some basic libraries, system-wide (Ruby bundler and the rspec stuff). Unfortunately, required Ruby code is rarely all present in Debian and you still need to install extra gems. In this case we set it up within the `vendor/bundle` directory to isolate them from the global search path. Finally, to run the tests, you need to wrap your invocation with `bundle exec`, like so: bundle exec rake test ## Validating Puppet code You SHOULD run validation checks on commit locally before pushing your manifests. To install those hooks, you should clone this repository: git clone https://github.com/anarcat/puppet-git-hooks ... and deploy it as a pre-commit hook: ln -s $PWD/puppet-git-hooks tor-puppet/.git/hooks/pre-commit A server-side validation hook hasn't been enabled yet because our manifests would sometimes fail and the hook was found to be somewhat slow. That is being worked on in [issue 31226][]. ## Listing all hosts under puppet This will list all active hosts known to the Puppet master: ssh -t pauli.torproject.org 'sudo -u postgres psql puppetdb -P pager=off -A -t -c "SELECT c.certname FROM certnames c WHERE c.deactivated IS NULL"' The following will list all hosts under Puppet and their `virtual` value: ssh -t pauli.torproject.org "sudo -u postgres psql puppetdb -P pager=off -F',' -A -t -c \"SELECT c.certname, value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id INNER JOIN certnames c ON c.certname = fs.certname WHERE fp.name = 'virtual' AND c.deactivated IS NULL\"" | tee hosts.csv The resulting file is a Comma-Separated Value (CSV) file which can be used for other purposes later. Possible values of the `virtual` field can be obtain with a similar query: ssh -t pauli.torproject.org "sudo -u postgres psql puppetdb -P pager=off -A -t -c \"SELECT DISTINCT value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id WHERE fp.name = 'virtual';\"" The currently known values are: `kvm`, `physical`, and `xenu`. As a bonus, this query will show the number of hosts running each release: SELECT COUNT(c.certname), value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id INNER JOIN certnames c ON c.certname = fs.certname WHERE fp.name = 'lsbdistcodename' AND c.deactivated IS NULL GROUP BY value_string; ### Other ways of extracting a host list * Using the [PuppetDB API][]: curl -s -G http://localhost:8080/pdb/query/v4/facts | jq -r ".[].certname" The [fact API][] is quite extensive and allows for very complex queries. For example, this shows all hosts with the `apache2` fact set to `true`: curl -s -G http://localhost:8080/pdb/query/v4/facts --data-urlencode 'query=["and", ["=", "name", "apache2"], ["=", "value", true]]' | jq -r ".[].certname" This will list all hosts sorted by their report date, older first, followed by the timestamp, space-separated: curl -s -G http://localhost:8080/pdb/query/v4/nodes | jq -r 'sort_by(.report_timestamp) | .[] | "\(.certname) \(.report_timestamp)"' | column -s\ -t This will list all hosts with the `roles::static_mirror` class: curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { resources { type = "Class" and title = "Roles::Static_mirror" }} ' | jq .[].certname This will show all hosts running Debian buster: curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=nodes { facts { name = "lsbdistcodename" and value = "buster" }}' | jq .[].certname * Using [howto/cumin](howto/cumin) * Using LDAP: HOSTS=$(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b dc=torproject,dc=org -LLL "hostname=*.torproject.org" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort') for i in `echo $HOSTS`; do mkdir hosts/x-$i 2>/dev/null || continue; echo $i; ssh $i ' ...'; done the `mkdir` is so that I can run the same command in many terminal windows and each host gets only one once [PuppetDB API]: https://puppet.com/docs/puppetdb/4.3/api/index.html [fact API]: https://puppet.com/docs/puppetdb/4.3/api/query/v4/facts.html ## Running Puppet everywhere There are many ways to [run a command on all hosts (see next section)][], but the TL;DR: is to basically use [cumin](howto/cumin) and run this command: [run a command on all hosts (see next section)]: #batch-jobs-on-all-hosts cumin -o txt -b 5 '*' 'puppet agent -t' But before doing this, consider doing a [progressive deployment](#progressive-deployment) instead. ## Batch jobs on all hosts With that trick, a job can be ran on all hosts with [parallel-ssh][], for example, check the `uptime`: cut -d, -f1 hosts.hsv | parallel-ssh -i -h /dev/stdin uptime This would do the same, but only on physical servers: grep 'physical$' hosts.hsv | cut -d -f1 | parallel-ssh -i -h /dev/stdin uptime This would fetch the `/etc/motd` on all machines: cut -d -f1 hosts.csv | parallel-slurp -h /dev/stdin -L motd /etc/motd motd To run batch commands through `sudo` that requires a password, you will need to fool both `sudo` and ssh a little more: cut -d -f1 hosts.csv | parallel-ssh -P -I -i -x -tt -h /dev/stdin -o pvs sudo pvs You should then type your password then Control-d. Warning: this will show your password on your terminal and probably in the logs as well. Batch jobs can also be ran on all Puppet hosts with Cumin: ssh -N -L8080:localhost:8080 pauli.torproject.org & cumin '*' uptime See [howto/cumin](howto/cumin) for more examples. [parallel-ssh]: https://parallel-ssh.org/ ## Progressive deployment If you are making a major change to the infrastructure, you may want to deploy it progressively. A good way to do so is to include the new class manually in the node configuration, say in `hiera/nodes/$fqdn.yaml`: classes: - my_new_class Then you can check the effect of the class on the host with the `--noop` mode. Make sure you disable Puppet so that automatic runs do not actually execute the code, with: puppet agent --disable "testing my_new_class deployment" Then the new manifest can be simulated with this command: puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment" Examine the output and, once you are satisfied, you can re-enable the agent and actually run the manifest with: puppet agent --enable ; puppet agent -t If the change is *inside* an existing class, that change can be enclosed in a class parameter and that parameter can be passed as an argument from Hiera. This is how the transition to a managed `/etc/apt/sources.list` file was done: 1. first, a parameter was added to the class that would remove the file, defaulting to `false`: class torproject_org( Boolean $manage_sources_list = false, ) { if $manage_sources_list { # the above repositories overlap with most default sources.list file { '/etc/apt/sources.list': ensure => absent, } } } 2. then that parameter was enabled on one host, say in `hiera/nodes/brulloi.torproject.org.yaml`: torproject_org::manage_sources_list: true 3. Puppet was run on that host using the simulation mode: puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment" 4. when satisfied, the real operation was done: puppet agent --enable ; puppet agent -t --noop 5. then this was added to two other hosts, and Puppet was ran there 6. finally, all hosts were checked to see if the file was present on hosts and had any content, with [howto/cumin](howto/cumin) (see above for alternative way of running a command on all hosts): cumin '*' 'du /etc/apt/sources.list' 7. since it was missing everywhere, the parameter was set to `true` by default and the custom configuration removed from the three test nodes 8. then Puppet was ran by hand everywhere, using Cumin, with a batch of 5 hosts at a time: cumin -o txt -b 5 '*' 'puppet agent -t' because Puppet returns a non-zero value when changes are made, this will above when any one host in a batch of 5 will actually operate a change. You can then examine the output and see if the change is legitimate or abort the configuration change. ## Troubleshooting ### Running Puppet by hand and logging When a Puppet manifest is not behaving as it should, the first step is to run it by hand on the host: puppet agent -t If that doesn't yield enough information, you can see pretty much everything that Puppet does with the `--debug` flag. This will, for example, include `Exec` resources `onlyif` commands and allow you to see why they do not work correctly (a common problem): puppet agent -t --debug Finally, some errors show up only on the Puppet server: you can look in `/var/log/daemon.log` there for errors that will only show up there. ### Finding exported resources with SQL queries Connecting to the PuppetDB database itself can sometimes be easier than trying to operate the API. There you can inspect the entire thing as a normal SQL database, use this to connect: sudo -u postgres psql puppetdb It's possible exported resources do surprising things sometimes. It is useful to look at the actual PuppetDB to figure out which tags exported resources have. For example, this query lists all exported resources with `troodi` in the name: SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE exported = 't' AND title LIKE '%troodi%'; Keep in mind that there are [automatic tags](https://puppet.com/docs/puppet/6.4/lang_tags.html) in exported resources which can complicate things. ### Finding exported resources with PuppetDB This query will look for exported resources with the `type` `Backupninja::Server::Account` (which can be a class, define, or builtin resource) and a `title` (the "name" of the resource as defined in the manifests) of `backup-blah@backup.koumbit.net`: curl -s -X POST http://localhost:8080/pdb/query/v4 \ -H 'Content-Type:application/json' \ -d '{"query": "resources { type = \"Backupninja::Server::Account\" and title = \"backup-blah@backup.koumbit.net\" }"}' \ | jq . | less -SR TODO: update the above query to match resources actually in use at TPO. That example is from koumbit.org folks. ## Password management If you need to set a password in a manifest, there are special functions to handle this. We do not want to store passwords directly in Puppet source code, for various reasons: it is hard to erase because code is stored in git, but also, ultimately, we want to publish that source code publicly. We use Trocla for this purpose, which generates random passwords and stores the hash or, if necessary, the clear-text in a YAML file. With Trocla, each password is generated on the fly from a secure entropy source ([Ruby's SecureRandom module][]) and stored inside a state file (in `/var/lib/trocla/trocla_data.yml`, configured `/etc/puppet/troclarc.yaml`) on the Puppet master. Trocla can return "hashed" versions of the passwords, so that the plain text password is never visible from the client. The plain text can still be stored on the Puppet master, or it can be deleted once it's been transmitted to the user or another password manager. This makes it possible to have Trocla not keep any secret at all. [Ruby's SecureRandom module]: https://ruby-doc.org/stdlib-1.9.3/libdoc/securerandom/rdoc/SecureRandom.html [Trocla]: https://github.com/duritong/trocla This piece of code will generate a [bcrypt][]-hashed password for the Grafana admin, for example: $grafana_admin_password = trocla('grafana_admin_password', 'bcrypt') The plain-text for that password will never leave the Puppet master. it will still be stored on the Puppet master, and you can see the value with: trocla get grafana_admin_password plain ... on the command-line. [bcrypt]: https://en.wikipedia.org/wiki/Bcrypt A password can also be set with this command: trocla set grafana_guest_password plain Note that this might *erase* other formats for this password, although those will get regenerated as needed. Also note that `trocla get` will fail if the particular password or format requested does not exist. For example, say you generate a plain-text password with and then get the `bcrypt` version: trocla create test plain trocla get test bcrypt This will return the empty string instead of the hashed version. Instead, use `trocla create` to generate that password. In general, it's safe to use `trocla create` as it will reuse existing password. It's actually how the `trocla()` function behaves in Puppet as well. TODO: Trocla can provide passwords to classes transparently, without having to do function calls inside Puppet manifests. For example, this code: class profile::grafana { $password = trocla('profile::grafana::password', 'plain') # ... } Could simply be expressed as: class profile::grafana(String $password) { # ... } But this requires a few changes: 1. Trocla needs to be included in Hiera 2. We need roles to be more clearly defined in Hiera, and use Hiera as an ENC so that we can do per-roles passwords (for example), which is not currently possible. ## Getting information from other nodes A common pattern in Puppet is to deploy resources on a given host with information from another host. For example, you might want to grant access to host A from host B. And while you can hardcode host B's IP address in host A's manifest, it's not good practice: if host B's IP address changes, you need to change the manifest, and that practice makes it difficult to introduce host C into the pool... So we need ways of having a node use information from other nodes in our Puppet manifests. There are 5 methods in our Puppet source code at the time of writing: * Exported resources * PuppetDB lookups * Puppet Query Language (PQL) * LDAP lookups * Hiera lookups This section walks through how each method works, outlining the advantage/disadvantage of each. ### Exported resources Our Puppet configuration supports [exported resources](https://puppet.com/docs/puppet/latest/lang_exported.html), a key component of complex Puppet deployments. Exported resources allow one host to define a configuration that will be *exported* to the Puppet server and then *realized* on another host. We commonly use this to punch holes in the firewall between nodes. For example, this manifest in the `roles::puppetmaster` class: @@ferm::rule::simple { "roles::puppetmaster-${::fqdn}": tag => 'roles::puppetmaster', description => 'Allow Puppetmaster access to LDAP', port => ['ldap', 'ldaps'], saddr => $base::public_addresses, } ... exports a firewall rule that will, later, allow the Puppet server to access the LDAP server (hence the `port => ['ldap', 'ldaps']` line). This rule doesn't take effect on the host applying the `roles::puppetmaster` class, but only on the LDAP server, through this rather exotic syntax: Ferm::Rule::Simple <<| tag == 'roles::puppetmaster' |>> This tells the LDAP server to apply whatever rule was exported with the `@@` syntax and the specified `tag`. Any Puppet resource can be exported and realized that way. Note that there are security implications with collecting exported resources: it delegates the resource specification of a node to another. So, in the above scenario, the Puppet master could decide to open *other* ports on the LDAP server (say, the SSH port), because it exports the port number and the LDAP server just blindly applies the directive. A more secure specification would explicitly specify the sensitive information, like so: Ferm::Rule::Simple <<| tag == 'roles::puppetmaster' |>> { port => ['ldap'], } But then a compromised server could send a different `saddr` and there's nothing the LDAP server could do here: it cannot override the address because it's exactly the information we need from the other server... ### PuppetDB lookups A common pattern in Puppet is to extract information from host A and use it on host B. The above "exported resources" pattern can do this for files, commands and many more resources, but sometimes we just want a tiny bit of information to embed in a configuration file. This could, in theory, be done with an exported [concat](https://forge.puppet.com/puppetlabs/concat) resource, but this can become prohibitively complicated for something as simple as an allowed IP address in a configuration file. For this we use the [puppetdbquery module](https://github.com/dalen/puppet-puppetdbquery), which allows us to do elegant queries against PuppetDB. For example, this will extract the IP addresses of all nodes with the `roles::gitlab` class applied: $allow_ipv4 = query_nodes('Class[roles::gitlab]', 'networking.ip') $allow_ipv6 = query_nodes('Class[roles::gitlab]', 'networking.ip6') This code, in `profile::kgb_bot`, propagates those variables into a template through the `allowed_addresses` variable, which gets expanded like this: <% if $allow_addresses { -%> <% $allow_addresses.each |String $address| { -%> allow <%= $address %>; <% } -%> deny all; <% } -%> Note that there is a potential security issue with that approach. The same way that exported resources trust the exporter, we trust that the node exported the right fact. So it's in theory possible that a compromised Puppet node exports an evil IP address in the above example, granting access to an attacker instead of the proper node. If that is a concern, consider using LDAP or Hiera lookups instead. Also note that this will eventually fail when the node goes down: after a while, resources are expired from the PuppetDB server and the above query will return an empty list. This seems reasonable: we do want to eventually revoke access to nodes that go away, but it's still something to keep in mind. Keep in mind that the `networking.ip` fact, in the above example, might be incorrect in the case of a host that's behind NAT. In that case, you should use LDAP or Hiera lookups. Note that this could also be implemented with a `concat` exported resource, but much harder because you would need some special case when no resource is exported (to avoid adding the `deny`) and take into account that other configurations might also be needed in the file. It would have the same security and expiry issues anyways. ### Puppet query language Note that there's also a way to do those queries without a Forge module, through the [Puppet query language](https://puppet.com/docs/puppetdb/5.2/api/query/tutorial-pql.html) and the `puppetdb_query` function. The problem with that approach is that the function is not very well documented and the query syntax is somewhat obtuse. For example, this is what I came up with to do the equivalent of the `query_nodes` call, above: $allow_ipv4 = puppetdb_query( ['from', 'facts', ['and', ['=', 'name', 'networking.ip'], ['in', 'certname', ['extract', 'certname', ['select_resources', ['and', ['=', 'type', 'Class'], ['=', 'title', 'roles::gitlab']]]]]]]) It seems like I did something wrong, because that returned an empty array. I could not figure out how to debug this, and apparently I needed more functions (like `map` and `filter`) to get what I wanted (see [this gist](https://gist.github.com/bastelfreak/b9620fa1892ebcc659c442b115db34f9)). I gave up at that point: the `puppetdbquery` abstraction is much cleaner and usable. If you are merely looking for a hostname, however, PQL might be a little more manageable. For example, this is how the `roles::onionoo_frontend` class finds its backends to setup the [IPsec](ipsec) network: $query = 'nodes[certname] { resources { type = "Class" and title = "Roles::Onionoo_backend" } }' $peer_names = sort(puppetdb_query($query).map |$value| { $value["certname"] }) $peer_names.each |$peer_name| { $network_tag = [$::fqdn, $peer_name].sort().join('::') ipsec::network { "ipsec::${network_tag}": peer_networks => $base::public_addresses } } ### LDAP lookups Our Puppet server is hooked up to the LDAP server and has information about the hosts defined there. Information about the node running the manifest is available in the global `$nodeinfo` variable, but there is also a `$allnodeinfo` parameter with information about every host known in LDAP. A simple example of how to use the `$nodeinfo` variable is how the `base::public_address` and `base::public_address6` parameters -- which represent the IPv4 and IPv6 public address of a node -- are initialized in the `base` class: class base( Stdlib::IP::Address $public_address = filter_ipv4(getfromhash($nodeinfo, 'ldap', 'ipHostNumber'))[0], Optional[Stdlib::IP::Address] $public_address6 = filter_ipv6(getfromhash($nodeinfo, 'ldap', 'ipHostNumber'))[0], ) { $public_addresses = [ $public_address, $public_address6 ].filter |$addr| { $addr != undef } } This loads the `ipHostNumber` field from the `$nodeinfo` variable, and uses the `filter_ipv4` or `filter_ipv6` functions to extract the IPv4 or IPv6 addresses respectively. A good example of the `$allnodeinfo` parameter is how the `roles::onionoo_frontend` class finds the IP addresses of its backend. After having loaded the host list from PuppetDB, it then uses the parameter to extract the IP address: $backends = $peer_names.map |$name| { [ $name, $allnodeinfo[$name]['ipHostNumber'].filter |$a| { $a =~ Stdlib::IP::Address::V4 }[0] ] }.convert_to(Hash) Such a lookup is considered more secure than going through PuppetDB as LDAP is a trusted data source. It is also our source of truth for this data, at the time of writing. ### Hiera lookups For more security-sensitive data, we should use a trusted data source to extract information about hosts. We do this through Hiera lookups, with the [lookup](https://puppet.com/docs/puppet/latest/function.html#lookup) function. A good example is how we populate the SSH public keys on all hosts, for the admin user. In the `profile::ssh` class, we do the following: $keys = lookup('profile::admins::keys', Data, 'hash') This will lookup the `profile::admin::keys` field in Hiera, which is a trusted source because under the control of the Puppet git repo. This refers to the following data structure in `hiera/common.yaml`: profile::admins::keys: anarcat: type: "ssh-rsa" pubkey: "AAAAB3[...]" The key point with Hiera is that it's a "hierarchical" data structure, so each host can have its own override. So in theory, the above keys could be overridden per host. Similarly, the IP address information for each host could be stored in Hiera instead of LDAP. But in practice, we do not currently do this and the per-host information is limited. ## Revoking and generating a new certificate for a host Revocation procedures problems were discussed in [33587][] and [33446][]. [33587]: https://bugs.torproject.org/33587 [33446]: https://gitlab.torproject.org/legacy/trac/-/issues/33446#note_2349434 1. Clean the certificate on the master puppet cert clean host.torproject.org 2. Clean the certificate on the client: find /var/lib/puppet/ssl -name host.torproject.org.pem -delete 3. Then run the bootstrap script on the client from `tsa-misc/installer/puppet-bootstrap-client` and get a new checksum 4. Run `tpa-puppet-sign-client` on the master and pass the checksum 5. Run `puppet agent -t` to have puppet running on the client again. ## Pager playbook ### catalog run: PuppetDB warning: did not update since \[...\] If you see an error like: Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z It can also be eventually accompanied with the puppet server reporting the same problem: Subject: ** PROBLEM Service Alert: pauli/puppet - all catalog runs is WARNING ** [...] Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z One of the following is happening, in decreasing likeliness: 1. the node's Puppet manifest has an error of some sort that makes it impossible to run the catalog 2. the node is down and has failed to report since the last time specified 3. the Puppet **server** is down and **all** nodes will fail to report in the same way (in which case a lot more warnings will show up, and other warnings about the server will come in) The first situation will usually happen after someone pushed a commit introducing the error. We try to keep all manifests compiling all the time and such errors should be immediately fixed. Look at the history of the Puppet source tree and try to identify the faulty commit. Reverting such a commit is acceptable to restore the service. The second situation can happen if a node is in maintenance for an extended duration. Normally, the node will recover when it goes back online. If a node is to be permanently retired, it should be removed from Puppet, using the [host retirement procedures](howto/retire-a-host). Finally, if the main Puppet **server** is down, it should definitely be brought back up. See disaster recovery, below. In any case, running the Puppet agent on the affected node should give more information: ssh NODE puppet agent -t ### Problems pushing to the Puppet server Normally, when you push new commits to the Puppet server, a hook runs and updates the working copy. But sometimes this fails with an error like: remote: error: unable to unlink old 'modules/ipsec/misc/config.yaml': Permission denied. The problem, in such cases, is that the files in the `/etc/puppet/` checkout are not writable by your user. It could also happen that the repository itself (in `/srv/puppet.torproject.org/git/tor-puppet`) could have permission issues. This problem is described in [issue 29663][] and is due to someone not pushing properly before you. To fix the permissions, try: sudo chown -R root:adm /etc/puppet sudo chown :puppet /etc/puppet/secret sudo chmod -R g+rw /etc/puppet sudo chmod g-w /etc/puppet/secret [issue 29663]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29663 A similar recipe could be applied to the git repository, as needed. Hopefully this will be resolved when we start deploying with a role account instead. ## Disaster recovery Ideally, the main Puppet server would be deployable from Puppet bootstrap code and the [main installer](new-machine). But in practice, much of its configuration was done manually over the years and it MUST be restored from [backups](backup) in case of failure. This probably includes a restore of the [PostgreSQL](postgresql) database backing the PuppetDB server as well. It's *possible* this step *could* be skipped in an emergency, because most of the information in PuppetDB is a cache of exported resources, reports and facts. But it could also break hosts and make converging the infrastructure impossible, as there might be dependency loops in exported resources. In particular, the Puppet server needs access to the LDAP server, and that is configured in Puppet. So if the Puppet server needs to be rebuilt from scratch, it will need to be manually allowed access to the LDAP server to compile its manifest. So it is strongly encouraged to restore the PuppetDB server database as well in case of disaster. This also applies in case of an IP address change of the Puppet server, in which case access to the LDAP server needs to be manually granted before the configuration can run and converge. This is a known bootstrapping issue with the Puppet server and is further discussed in the [design section](#ldap-integration). # Reference This documents generally how things are setup. ## Installation Setting up a new Puppet server from scratch is not supported, or, to be more accurate, would be somewhat difficult. The server expects various external services to populate it with data, in particular: * it [fetches data from LDAP](#ldap-integration) * [Nagios generates the NRPE configuration](#nagios-integration) * the [letsencrypt repository manages the TLS certificates](#lets-encrypt-tls-certificates) The auto-ca component is also deployed manual, and so are the git hooks, repositories and permissions. This needs to be documented, automated and improved. Ideally, it should be possible to install a new Puppet server from scratch using nothing but a Puppet bootstrap manifest, see [issue 30770][] and [issue 29387][], along with [discussion about those improvements in this page](#proposed-solution), for details. [issue 30770]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30770 ## SLA No formal SLA is defined. Puppet runs on a fairly slow `cron` job so doesn't have to be highly available right now. This could change in the future if we rely more on it for deployments. ## Design The Puppet server and PuppetDB currently live on `pauli`. That server was setup in 2011 by weasel. It follows the configuration of the Debian Sysadmin (DSA) Puppet server, which has its source code available in the [dsa-puppet repository](https://salsa.debian.org/dsa-team/mirror/dsa-puppet/). The service is maintained by TPA and manages *all* TPA-operated machines. Ideally, all services are managed by Puppet, but historically, only basic services were configured through Puppet, leaving service admins responsible for deploying their services on top of it. That tendency has shifted recently (~2020) with the deployment of the [GitLab](gitlab) service through Puppet, for example. The source code to the Puppet manifests (see below for a Glossary) is managed through git on a repository hosted directly on the Puppet server. Agents are deployed as part of the [install process](new-machine), and talk to the central server using a Puppet-specific certificate authority (CA). As mentioned in the [installation section](#installation), the Puppet server assumes a few components (namely [LDAP](ldap), [Nagios](nagios), [Let's Encrypt](tls) and auto-ca) feed information into it. This is also detailed in the sections below. In particular, Puppet acts as a duplicate "source of truth" for some information about servers. For example, LDAP has a "purpose" field describing what a server is for, but Puppet also has the concept of a role, attributed through Hiera (see [issue 30273][]). A similar problem exists with IP addresses and user access control, in general. [issue 30273]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30273 Puppet is generally considered stable, but the code base is somewhat showing its age and has accumulated some technical debt. For example, much of the Puppet code deployed is specific to Tor (and DSA, to a certain extent) and therefore is only maintained by a handful of people. It would be preferable to migrate to third-party, externally maintained modules (e.g. [systemd](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33449), but also many others, see [issue 29387][] for details). A similar problem exists with custom Ruby code implemented for various functions, which is being replaced with Hiera ([issue 30020][]). The Puppet infrastructure being kept up to date with the latest versions in Debian but will require some work to port to Puppet 6, as the current deployment system ("puppetmaster") has been removed in that new release (see [issue 33588][]). [issue 33588]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/33588 ### Glossary This is a subset of the [Puppet glossary](https://puppet.com/docs/puppet/latest/glossary.html) to quickly get you started with the vocabulary used in this document. * **Puppet node**: a machine (virtual or physical) running Puppet * **Manifest**: Puppet source code * **Catalog**: a set of compiled of Puppet source which gets applied on a **node** by a **Puppet agent** * **Puppet agents**: the Puppet program that runs on all nodes to apply manifests * **Puppet server**: the server which all **agents** connect to to fetch their **catalog**, also known as a **Puppet master** in older Puppet versions (pre-6) * **Facts**: information collected by Puppet agents on nodes, and exported to the Puppet server * **Reports**: log of changes done on nodes recorded by the Puppet server * **[PuppetDB](https://puppet.com/docs/puppetdb/) server**: an application server on top of a PostgreSQL database providing an [API](https://puppet.com/docs/puppetdb/5.2/api/index.html) to query various resources like node names, facts, reports and so on ### File layout The Puppet server and PuppetDB server run on `pauli.torproject.org`. That is where the main git repository (`tor-puppet`) lives, in `/srv/puppet.torproject.org/git/tor-puppet`. That repository has hooks to populate `/etc/puppet` which is the live checkout from which the Puppet server compiles its catalogs. All paths below are relative to the root of that git repository. - `3rdparty/modules` include modules that are shared publicly and do not contain any TPO-specific configuration. There is a `Puppetfile` there that documents where each module comes from and that can be maintained with [r10k][] or [librarian][]. [librarian]: https://librarian-puppet.com/ [r10k]: https://github.com/puppetlabs/r10k/ - `modules` includes roles, profiles, and classes that make the bulk of our configuration. - each node is assigned a "role" through Hiera, in `hiera/nodes/$FQDN.yaml` To be more accurate, Hiera assigns a Puppet class to each node, although each node should have only one special purpose class, a "role", see [issue 40030][] for progress on that transition. [issue 40030]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40030 - The `torproject_org` module (`modules/torproject_org/manifests/init.pp`) performs basic host initialisation, like configuring Debian mirrors and APT sources, installing a base set of packages, configuring puppet and timezone, setting up a bunch of configuration files and running `ud-replicate`. - There is also the `hoster.yaml` file (`modules/torproject_org/misc/hoster.yaml`) which defines hosting providers and specifies things like which network blocks they use, if they have a DNS resolver or a Debian mirror. `hoster.yaml` is read by - the `nodeinfo()` function (`modules/puppetmaster/lib/puppet/parser/functions/nodeinfo.rb`), used for setting up the `$nodeinfo` variable - `ferm`'s `def.conf` template (`modules/ferm/templates/defs.conf.erb`) - The root of definitions and execution is in Puppet is found in the `manifests/site.pp` file, but this file is now mostly empty, in favor of Hiera. Note that the above is the current state of the file hierarchy. As part Hiera transition ([issue 30020][]), a lot of the above architecture will change in favor of the more standard [role/profile/module][] pattern. Note that this layout might also change in the future with the introduction of a role account ([issue 29663][]) and when/if the repository is made public (which requires changing the layout). See [ticket #29387][] for an in-depth discussion. [issue 29387]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387 [role/profile/module]: https://puppet.com/docs/pe/2017.2/r_n_p_intro.html [ticket #29387]: https://bugs.torproject.org/29387 [issue 30020]: https://bugs.torproject.org/30020 ### Installed packages facts The `modules/torproject_org/lib/facter/software.rb` file defines our custom facts, making it possible to get answer to questions like "Is this host running `apache2`?" by simply looking at a puppet variable. Those facts are deprecated and we should instead install packages through Puppet instead of manually installing packages on hosts. ### Style guide Puppet manifests should generally follow the [Puppet style guide][]. This can be easily done with [Flycheck][] in Emacs, [vim-puppet][], or a similar plugin in your favorite text editor. Many files do not *currently* follow the style guide, as they *predate* the creation of said guide. Files should *not* be completely reformatted unless there's a good reason. For example, if a conditional covering a large part of a file is removed and the file needs to be re-indented, it's a good opportunity to fix style in the file. Same if a file is split in two components or for some other reason completely rewritten. Otherwise the style already in use in the file should be followed. [Puppet style guide]: https://puppet.com/docs/puppet/4.8/style_guide.html [Flycheck]: http://flycheck.org/ [vim-puppet]: https://github.com/rodjek/vim-puppet ### Hiera [Hiera][] is a "key/value lookup tool for configuration data" which Puppet uses to look up values for class parameters and node configuration in General. We are in the process of transitioning over to this mechanism from our previous set of custom YAML lookup system. This documents the way we currently use Hiera. [Hiera]: https://puppet.com/docs/hiera/ #### Classes definitions Each host declares which class it should include through a `classes` parameter. For example, this is what configures a Prometheus server: classes: - roles::monitoring Roles should be *abstract* and *not* implementation specific. Each role includes a set of profiles which *are* implementation specific. For example, the `monitoring` role includes `profile::prometheus::server` and `profile::grafana`. Do *not* include profiles directly from Hiera. As a temporary exception to this rule, old modules can be included as we transition from the `has_role` mechanism to Hiera, but eventually those should be ported to shared modules from the Puppet forge, with our glue built into a profile on top of the third-party module. The role `roles::monitoring` follows that pattern correctly. See [issue 40030][] for progress on that work. #### Node configuration On top of the host configuration, some node-specific configuration can be performed from Hiera. This should be avoided as much as possible, but sometimes there is just no other way. A good example was the `build-arm-*` nodes which included the following configuration: bacula::client::ensure: "absent" This disables backups on those machines, which are normally configured everywhere. This is done because they are behind a firewall and therefore not reachable, an unusual condition in the network. Another example is `nutans` which sits behind a NAT so it doesn't know its own IP address. To export proper firewall rules, the allow address has been overridden as such: bind::secondary::allow_address: 89.45.235.22 Those types of parameters are normally automatically guess inside modules' classes, but they are overriddable from Hiera. Note: eventually *all* host configuration will be done here, but there are currently still some configurations hardcoded in individual modules. For example, the Bacula director is hardcoded in the `bacula` base class (in `modules/bacula/manifests/init.pp`). That should be moved into a class parameter, probably in `common.yaml`. ### Cron and scheduling The Puppet agent is *not* running as a daemon, it's running through good old `cron`. Puppet runs on each node every four hour, although with a random 2h jitter, so the actual frequency is somewhere between 4 and 6 hours. This configuration is in `/etc/cron.d/puppet-crontab` and deployed by Puppet itself, currently as part of the `torproject_org` module. ### LDAP integration The Puppet is configured to talk with Puppet through a few custom functions defined in `modules/puppetmaster/lib/puppet/parser/functions`. The main plumbing function is called `ldapinfo()` and connects to the LDAP server through `db.torproject.org` over TLS on port 636. It takes a hostname as an argument and will load all hosts matching that pattern under the `ou=hosts,dc=torproject,dc=org` subtree. If the specified hostname is the `*` wildcard, the result will be a hash of `host => hash` entries, otherwise only the `hash` describing the provided host will be returned. The `nodeinfo()` function uses that function to populate the global `$nodeinfo` hash available globally, or, more specifically, the `$nodeinfo['ldap']` component. It also loads the `$nodeinfo['hoster']` value from the `whohosts()` function. That function, in turn, tries to match the IP address of the host against the "hosters" defined in the `hoster.yaml` file. The `allnodeinfo()` function does a similar task as `nodeinfo()`, except that it loads *all* nodes from LDAP, into a single hash. It does *not* include the "hoster" and is therefore equivalent to calling `nodeinfo()` on each host and extracting only the `ldap` member hash (although it is not implemented that way). Puppet does not require any special credentials to access the LDAP server. It accesses the LDAP database anonymously, although there is a firewall rule (defined in Puppet) that grants it access to the LDAP server. There is a bootstrapping problem there: if one would be to rebuild the Puppet server, it would actually fail to compile its catalog because it would not be able to connect to the LDAP server to fetch its catalog, unless the LDAP server has been manually configured to let the Puppet server through. NOTE: much (if not all?) of this is being moved into Hiera, in particular the YAML files. See [issue 30020](https://trac.torproject.org/projects/tor/ticket/30020) for details. Moving the host information into Hiera would resolve the bootstrapping issues, but would require, in turn some more work to resolve questions like how users get granted access to individual hosts, which is currently managed by `ud-ldap`. We cannot, therefore, simply move host information from LDAP into Hiera without creating a duplicate source of truth without rebuilding or tweaking the user distribution system. See also the [LDAP design document](ldap#Design) for more information about how LDAP works. ### Nagios integration Nagios (which is really Icinga, but let's call it Nagios because that's how it's called everywhere in the source) is hooked into Puppet through an external sync system. Our [Nagios deployment](nagios) operates through Git hooks which run a special `Makefile` that compiles and deploys the Icinga configuration, but also compiles the client-side NRPE configuration. The NRPE configuration is generated on the Nagios server and then pushed to the Puppet server with `rsync` over SSH, using a public key distributed by Puppet from the `roles::puppetmaster` class. That key has a restricted `command` field which limits access to the Puppet manifest, in this single file: /etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg This file then gets distributed to all nodes through the `nagios::client` class using a simple `File` resource. So when a Nagios check is added or changed, Puppet needs to run on all the affected host for the check to take affect, on top of, of course, adding the check into the Nagios git repository. ### Let's Encrypt TLS certificates Public TLS certificates, as issued by Let's Encrypted, are distributed by Puppet. Those certificates are generated by the "letsencrypt" Git repository (see the [TLS documentation](tls) for details on that workflow). The relevant part, as far as Puppet is concerned, is that certificates magically end up in the following directory when a certificate is issued or (automatically) renewed: /srv/puppet.torproject.org/from-letsencrypt See also the [TLS deployment docs](tls#lets-encrypt-workflow) for how that directory gets populated. Normally, those files would not be available from the Puppet manifests, but the `ssl` Puppet module uses a special trick whereby those files are read by Puppet `.erb` templates. For example, this is how `.crt` files get generated on the Puppet master, in `modules/ssl/templates/crt.erb`: <%= fn = "/srv/puppet.torproject.org/from-letsencrypt/#{@name}.crt" out = File.read(fn) out %> Similar templates exist for the other files. Those certificates should not be confused with the "auto-ca" TLS certificates in use internally and which are deployed directly in `/etc/puppet/modules/ssl/files/`, see below. ### Internal auto-ca TLS certificates The Puppet server also manages an internal CA which we informally call "auto-ca". Those certificates are internal in that they are used to authenticate nodes to each other, not to the public. They are used, for example, to encrypt connections between mail servers (in Postfix) and [backup servers](backup) (in Bacula). The auto-ca deploys those certificates directly inside the Puppet server checkout, in `/etc/puppet/modules/ssl/files/certs/` and `.../clientcerts/`. Details of that system are available in the [TLS documentation](tls#internal-auto-ca). ## Issues There is no issue tracker specifically for this project, [File][] or [search][] for issues in the [team issue tracker][search] component. [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues ## Monitoring and testing Puppet is hooked into Nagios in two ways: * one job runs on the Puppetmaster and checks PuppetDB for reports. this was done with a [patched](https://github.com/evgeni/check_puppetdb_nodes/pull/14) version of the [check_puppetdb_nodes](https://github.com/evgeni/check_puppetdb_nodes/) Nagios check, now packaged inside the `tor-nagios-checks` Debian package * another job runs on each Puppet node and will therefore work even if the Puppetmaster dies for some reason. this is done with the [check_puppet_agent](https://github.com/aswen/nagios-plugins/blob/master/check_puppet_agent) Nagios check, now also packaged inside the `tor-nagios-checks` Debian package This was [implemented in March 2019](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29676). An alternative implementation [using Prometheus](https://forge.puppet.com/puppet/prometheus_reporter) was considered but [Prometheus still hasn't replaced Nagios](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864) at the time of writing. There are no validation checks and *a priori* no peer review of code: code is directly pushed to the Puppet server without validation. Work is being done to [implement automated checks](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31226) but that is only being deployed on some clients for now. ## Logs and metrics PuppetDB itself holds performance information about the Puppet agent runs, which are called "reports". Those reports contain information about changes operated on each server, how long the agent runs take and so on. Those metrics could be made more visible by using a dashboard, but that has not been implemented yet (see [issue 31969][]). [issue 31969]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31969 The Puppet server, Puppet agents and PuppetDB keep logs of their operations. The latter keeps its logs in `/var/log/puppetdb/` for a maximum of 90 days or 1GB, whichever comes first (configured in `/etc/puppetdb/request-logging.xml` and `/etc/puppetdb/logback.xml`). The other logs are sent to `syslog`, and usually end up in `daemon.log`. Puppet should hold minimal personally identifiable information, like user names, user public keys and project names. ## Other documentation * [Latest Puppet docs](https://puppet.com/docs/puppet/latest/puppet_index.html) - might be too new, see also the [Puppet 5.5 docs](https://puppet.com/docs/puppet/5.5/puppet_index.html) * [Function reference](https://puppet.com/docs/puppet/latest/function.html) * [Type reference](https://puppet.com/docs/puppet/latest/type.html) * [Mapping between versions of Puppet Entreprise, Facter, Hiera, Agent, etc](https://puppet.com/docs/pe/2019.0/component_versions_in_recent_pe_releases.html) # Discussion This section goes more in depth into how Puppet is setup, why it was setup the way it was, and how it could be improved. ## Overview Our Puppet setup dates back from 2011, according to the git history, and was probably based off the [Debian System Administrator's Puppet codebase](https://salsa.debian.org/dsa-team/mirror/dsa-puppet) which dates back to 2009. ## Goals The general goal of Puppet is to provide basic automation across the architecture, so that software installation and configuration, file distribution, user and some service management is done from a central location, managed in a git repository. This approach is often called [Infrastructure as code](https://en.wikipedia.org/wiki/Infrastructure_as_Code). This section also documents possible improvements to our Puppet configuration that we are considering. ### Must have * **secure**: only sysadmins should have access to push configuration, whatever happens. this includes deploying only audited and verified Puppet code into production. * **code review**: changes on servers should be verifiable by our peers, through a git commit log * **fix permissions issues**: deployment system should allow all admins to push code to the puppet server without having to constantly fix permissions (e.g. trough a [role account](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29663)) * **secrets handling**: there are some secrets in Puppet. those should remain secret. We mostly have this now, although there are concerns about permissions being wrong sometimes, which a role account could fix. ### Nice to have Those are mostly issues with the current architecture we'd like to fix: * **Continuous Integration**: before deployment, code should be vetted by a peer and, ideally, automatically checked for errors and tested * **single source of truth**: when we add/remove nodes, we should not have to talk to multiple services (see also the [install automation ticket](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31239) and the [new-machine discussion](new-machine#discussion) * **collaboration** with other sysadmins outside of TPA, for which we would need to... * ... **publicize our code** (see [ticket 29387](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387)) * **no manual changes**: every change on every server should be committed to version control somewhere * **bare-metal recovery**: it should be possible to recover a service's *configuration* from a bare Debian install with Puppet (and with data from the [backup](backup) service of course...) * **one commit only**: we shouldn't have to commit "twice" to get changes propagated (once in a submodule, once in the parent module, for example) ### Non-Goals * **ad hoc changes** to the infrastructure. one-off jobs should be handled by [fabric](fabric), Cumin, or straight SSH. ## Approvals required TPA should approve policy changes as per [tpa-rfc-1](/policy/tpa-rfc-1-policy). ## Proposed Solution To improve on the above "Goals", I would suggest the following configuration. TL;DR: 1. Use a control repository 2. Get rid of 3rdparty 3. Deploy with g10k 4. Authenticate with checksums 5. Deploy to branch-specific environments 6. Rename the default branch "production" 7. Push directly on the Puppet server 8. Use a role account 9. Use local test environments 10. Develop a test suite 11. Hook into CI 12. OpenPGP verification and web hook Steps 1-8 could be implemented without too much difficulty and should be a mid term objective. Steps 9 to 12 require significantly more work and could be implemented once the new infrastructure stabilizes. What follows is an explanation and justification of each step. ### Use a control repository The base of the infrastructure is a [control-repo](https://puppet.com/docs/pe/latest/control_repo.html) ([example](https://github.com/puppetlabs/control-repo), [another more complex example](https://github.com/example42/psick)) which chain-loads all the other modules. This implies turning all our "modules" into "profiles" and moving "real" modules (which are fit for public consumption) "outside", into public repositories (see also [issue 29387: publish our puppet repository](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387)). Note that the control repository *could* also be public: we could simply have the private data inside of Hiera or some other private repository. The control repository concept is specific to the proprietary version of Puppet (Puppet Enterprise or PE) but its logic should be usable with the open source Puppet release as well. ### Get rid of 3rdparty The control repo's core configuration file is the `Puppetfile`. We already use a Puppetfile, but only to manage modules inside of the `3rdparty` directory. Now it would manage *all* modules, or, more specifically, `3rdparty` would become the default `modules` directory which would, incidentally, encourage us to upstream our modules and publish them to the world. Our current `modules` directory would move into `site-modules`, which is the designated location for "roles, profiles, and custom modules". This has been suggested before in [issue 29387: publish our puppet repository](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387)) and is important for the `Puppetfile` to do its job. ### Deploy with g10k It seems clear that everyone is converging over the use of a `Puppetfile` to deploy code. While there are still monorepos out there, but they do make our life harder, especially when we need to operate on non-custom modules. Instead, we should converge towards *not* following upstream modules in our git repository. Modules managed by the `Puppetfile` would *not* be managed in our git monorepo and, instead, would be deployed by `r10k` or `g10k` (most likely the latter because of its support for checksums). Note that neither `r10k` or `g10k` resolve dependencies in a `Puppetfile`. We therefore also need a tool to verify the file correctly lists all required modules. The following solutions need to be validated but could address that issue: * [generate-puppetfile](https://github.com/rnelson0/puppet-generate-puppetfile): take a `Puppetfile` and walk the dependency tree, generating a new `Puppetfile` (see also [this introduction to the project](https://rnelson0.com/2015/11/06/introducing-generate-puppetfile-or-creating-a-ruby-program-to-update-your-puppetfile-and-fixtures-yml/)) * [Puppetfile-updater](https://github.com/camptocamp/puppetfile-updater): read the `Puppetfile` and fetch new releases * [ra10ke](https://github.com/voxpupuli/ra10ke): a bunch of Rake tasks to validate a `Puppetfile` * `r10k:syntax`: syntax check, see also `r10k puppetfile check` * `r10k:dependencies`: check for out of date dependencies * `r10k:solve_dependencies`: check for **missing** dependencies * `r10k:install`: wrapper around `r10k` to install with some caveats * `r10k:validate`: make sure modules are accessible * `r10k:duplicates`: look for duplicate declarations * [lp2r10k](https://github.com/dharmabruce/lp2r10k/): convert "librarian" `Puppetfile` (missing dependencies) into a "r10k" `Puppetfile` (with dependencies) Note that this list comes from the [updating your Puppetfile](https://github.com/puppetlabs/r10k/blob/master/doc/updating-your-puppetfile.mkd#automatic-updates) documentation in the r10k project, which is also relevant here. ### Authenticate code with checksums This part is the main problem with moving away from a monorepo. By using a monorepo, we can audit the code we push into production. But if we offload this to `r10k`, it can download code from wherever the `Puppetfile` says, effectively shifting our trust path from OpenSSH to HTTPS, the Puppet Forge, git and whatever remote gets added to the `Puppetfile`. There is no obvious solution for this right now, surprisingly. Here are two possible alternatives: 1. [g10k](https://github.com/xorpaul/g10k/) supports using a `:sha256sum` parameter to checksum modules, but that only works for Forge modules. Maybe we could pair this with using an explicit `sha1` reference for git repository, ensuring those are checksummed as well. The downside of that approach is that it leaves checked out git repositories in a "detached head" state. 2. `r10k` has a [pending pull request](https://github.com/puppetlabs/r10k/pull/823) to add a `filter_command` directive which could run after a git checkout has been performed. it could presumably be used to verify OpenPGP signatures on git commits, although this would work only on modules we sign commits on (and therefore not third party) It seems the best approach would be to use g10k for now with checksums on both git commit and forge modules. A validation hook running *before* g10k COULD validate that all `mod` lines have a `checksum` of some sort... Note that this approach does *NOT* solve the "double-commit" problem identified in the Goals. It is believed that only a "monorepo" would fix that problem and that approach comes in direct conflict with the "collaboration" requirement. We chose the latter. This could be implemented as a patch to `ra10ke`. ### Deploy to branch-specific environments A key feature of r10k (and, of course, g10k) is that they are capable of deploying code to new environments depending on the branch we're working on. We would enable that feature to allow testing some large changes to critical code paths without affecting all servers. ### Rename the default branch "production" In accordance with Puppet's best practices, the control repository's default branch would be called "production" and not "master". Also: Black Lives Matter. ### Push directly on the Puppet server Because we are worried about the GitLab attack surface, we could still keep on pushing to the Puppet server for now. The control repository could be mirrored to GitLab using a deploy key. All other repositories would be published on GitLab anyways, and there the attack surface would not matter because of the checksums in the control repository. ### Use a role account To avoid permission issues, use a role account (say `git`) to accept pushes and enforce git hooks. ### Use local test environments It should eventually be possible to test changes locally before pushing to production. This would involve radically simplifying the Puppet server configuration and probably either getting rid of the LDAP integration or at least making it optional so that changes can be tested without it. This would involve "puppetizing" the Puppet server configuration so that a Puppet server and test agent(s) could be bootstrapped automatically. Operators would run "smoke tests" (running Puppet by hand and looking at the result) to make sure their code works before pushing to production. ### Develop a test suite The next step is to start working on a test suite for services, at least for new deployments, so that code can be tested without running things by hand. Plenty of Puppet modules have such test suite, generally using [rspec-puppet](https://rspec-puppet.com/) and [rspec-puppet-facts](https://github.com/mcanevet/rspec-puppet-facts), and we already have a few modules in `3rdparty` that have such tests. The idea would be to have those tests on a per-role or per-profile basis. The Foreman people have published [their test infrastructure](https://github.com/theforeman/foreman-infra/tree/master/puppet) which could be useful as inspiration for our purposes here. ### Hook into continuous integration Once tests are functional, the last step is to move the control repository into GitLab directly and start running CI against the Puppet code base. This would probably not happen until GitLab CI is deployed, and would require lots of work to get there, but would eventually be worth it. The GitLab CI would be indicative: an operator would need to push to a topic branch there first to confirm tests pass but would still push directly to the Puppet server for production. Note that we are working on (client-side) validation hooks for now, see [issue 31226][]. [issue 31226]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31226 ### OpenPGP verification and web hook To stop pushing directly to the Puppet server, we could implement OpenPGP verification on the control repository. If a hook checks that commits are signed by a trusted party, it does not matter where the code is hosted. A good reference for OpenPGP verification is [this guix article](https://guix.gnu.org/blog/2020/securing-updates/) which covers a few scenarios and establishes a pretty solid verification workflow. There's also a larger project-wide discussion in [GitLab](howto/gitlab) [issue 81](https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/81). We could use the [webhook](https://github.com/voxpupuli/puppet_webhook) system to have GitLab notify the Puppet server to pull code. ## Cost N/A. ## Alternatives considered Ansible was considered for managing [GitLab](gitlab) for a while, but this was eventually abandoned in favor of using Puppet and the "Omnibus" package. For ad hoc jobs, [fabric](fabric) is being used. For code management, I have done a more extensive review of possible alternatives. [This talk](https://www.youtube.com/watch?v=RdIyStATgFE) is a good introduction for git submodule, librarian and r10k. Based on that talk and [these slide](https://arlimus.github.io/slides/librarian.and.r10k/), I've made the following observations: ### monorepo This is our current approach, which is that all code is committed in one monolithic repository. This effectively makes it impossible to share code outside of the repository with anyone else because there is private data inside, but also because it doesn't follow the standard role/profile/modules separation that makes collaboration possible at all. To work around that, I designed a workflow where we locally clone subrepos as needed, but this is clunky as it requires to commit every change twice: one for the subrepo, one for the parent. Our giant monorepo also mixes all changes together which can be an pro *and* a con: on the one hand it's easy to see and audit all changes at once, but on the other hand, it can be overwhelming and confusing. But it does allow us to integrate with librarian right now and is a good stopgap solution. A better solution would need to solve the "double-commit" problem and still allow us to have smaller repositories that we can collaborate on outside of our main tree. ### submodules The talk partially covers how difficult `git submodules` work and how hard they are to deal with. I say partially because submodules are even harder to deal with than the examples she gives. She shows how submodules are hard to add and remove, because the metadata is stored in stored in multiple locations (`.gitsubmodules`, `.git/config`, `.git/modules/` and the submodule repository itself). She also mentions submodules don't know about dependencies and it's likely you will break your setup if you forget one step. (See [this post](https://web.archive.org/web/20171101202911/http://somethingsinistral.net/blog/git-submodules-are-probably-not-the-answer/) for more examples.) In my experience, the biggest annoyance with submodules is the "double-commit" problem: you need to make commits in the submodule, then *redo* the commits in the parent repository to chase the head of that submodule. This does not improve on our current situation, which is that we need to do those two commits anyways in our giant monorepo. One advantage with submodules is that they're mostly standard: everyone knows about them, even if they're not familiar and their knowledge is reusable outside of Puppet. ### librarian Librarian is written in ruby. It's built on top of [another library called librarian](https://github.com/applicationsonline/librarian) that is used by Ruby's [bundler](https://gembundler.com/). At the time of the talk, was "pretty active" but unfortunately, librarian now seems to be [abandoned](https://github.com/voxpupuli/librarian-puppet/issues/48) so we might be forced to use r10k in the future, which has a quite different workflow. One problem with librarian right now is that `librarian update` clears any existing git subrepo and re-clones it from scratch. If you have temporary branches that were not pushed remotely, all of those are lost forever. That's really bad and annoying! it's by design: it "takes over your modules directory", as she explains in the talk and everything comes from the Puppetfile. Librarian does resolve dependencies recursively and store the decided versions in a lockfile which allow us to "see" what happens when you update from a Puppetfile. But there's no cryptographic chain of trust between the repository where the Puppetfile is and the modules that are checked out. Unless the module is checked out from git (which isn't the default), only version range specifiers constrain which code is checked out, which gives a huge surface area for arbitrary code injection in the entire puppet infrastructure (e.g. MITM, forge compromise, hostile upstream attacks) ### r10k r10k was written because librarian was too slow for large deployments. But it covers more than just managing code: it also manages environments and is designed to run on the Puppet master. It doesn't have dependency resolution or a `Puppetfile.lock`, however. See [this ticket](https://github.com/puppetlabs/r10k/issues/38), closed in favor of [that one](https://tickets.puppetlabs.com/browse/RK-3). r10k is more complex and very opiniated: it requires lots of configuration including its own YAML file, hooks into the Puppetmaster and can [take a while to deploy](http://garylarizza.com/blog/2014/02/18/puppet-workflow-part-3/). r10k is still in [active development](https://github.com/puppetlabs/r10k/releases) and is supported by Puppetlabs, so there's [official documentation](https://puppet.com/docs/pe/2019.1/r10k.html) in the Puppet documentation. Often used in conjunction with librarian for dependency resolution. One cool feature is that r10k allows you to create dynamic environments based on branch names. All you need is a single repo with a Puppetfile and r10k handles the rest. The problem, of course, is that you need to trust it's going to do the right thing. There's the security issue, but there's also the problem of resolving dependencies and you *do* end up double-committing in the end if you use branches in sub-repositories. But maybe that is unavoidable. (Note that there are ways of resolving dependencies with external tools, like [generate-puppetfile](https://github.com/rnelson0/puppet-generate-puppetfile) ([introduction](https://rnelson0.com/2015/11/06/introducing-generate-puppetfile-or-creating-a-ruby-program-to-update-your-puppetfile-and-fixtures-yml/)) or [this hack that reformats librarian output](https://github.com/dharmabruce/lp2r10k/blob/master/lp2r10k) or [those rake tasks](https://github.com/voxpupuli/ra10ke). there's also a [go rewrite called g10k](https://github.com/xorpaul/g10k) that is much faster, but with similar limitations.) ### git subtree [This article](https://web.archive.org/web/20171107082413/http://somethingsinistral.net/blog/scaling-puppet-environment-deployment/) mentions git subtrees from the point of view of Puppet management quickly. It outline how it's cool that the history of the subtree gets merged as is in the parent repo, which gives us the best of both world (individual, per-module history view along with a global view in the parent repo). It makes, however, rebasing in subtrees impossible, as it breaks the parent merge. You do end up with some of the disadvantages of the monorepo in the all the code is actually committed in the parent repo and you *do* have to commit twice as well. ### subrepo The [git-subrepo](https://github.com/ingydotnet/git-subrepo) is "an improvement from `git-submodule` and `git-subtree`". It is a mix between a monorepo and a submodule system, with modules being stored in a `.gitrepo` file. It is somewhat less well known than the other alternatives, presumably because it's newer? It is entirely written in `bash`, which I find somewhat scary. It is [not packaged in Debian yet](http://bugs.debian.org/911397) but might be soon. It works around the "double-commit issue" by having a special `git subrepo commit` command that "does the right thing". That, in general, is its major flaw: it reproduces many git commands like `init`, `push`, `pull` as subcommands, so you need to remember which command to run. To quote the (rather terse) manual: > All the subrepo commands use names of actual Git commands and try to > do operations that are similar to their Git counterparts. They also > attempt to give similar output in an attempt to make the subrepo > usage intuitive to experienced Git users. > > Please note that the commands are not exact equivalents, and do not > take all the same arguments Still, its feature set is impressive and could be the perfect mix between the "submodules" and "subtree" approach of still keeping a monorepo while avoiding the double-commit issue. ### myrepos [myrepos](https://myrepos.branchable.com/) is one of many solutions to manage multiple git repositories. It has been used in the past at my old workplace (Koumbit.org) to manage and checkout multiple git repositories. Like Puppetfile without locks, it doesn't enforce cryptographic integrity between the master repositories and the subrepositories: all it does is define remotes and their locations. Like r10k it doesn't handle dependencies and will require extra setup, although it's much lighter than r10k. Its main disadvantage is that it isn't well known and might seem esoteric to people. It also has weird failure modes, but could be used in parallel with a monorepo. For example, it might allow us to setup specific remotes in subdirectories of the monorepo automatically. ### Summary table | Approach | Pros | Cons | Summary | |------------|----------------------------|------------------------------------------|-----------------------------------| | Monorepo | Simple | Double-commit | Status quo | | Submodules | Well-known | Hard to use, double-commit | Not great | | Librarian | Dep resolution client-side | Unmaintained, bad integration with git | Not sufficient on its own | | r10k | Standard | Hard to deploy, opiniated | To evaluate further | | Subtree | "best of both worlds" | Still get double-commit, rebase problems | Not sure it's worth it | | Subrepo | subtree + optional | Unusual, new commands to learn | To evaluate further | | myrepos | Flexible | Esoteric | might be useful with our monorepo | ### Best practices survey I made a survey of the community (mostly the [shared puppet modules](https://gitlab.com/shared-puppet-modules-group/) and [Voxpupuli](https://voxpupuli.org/) groups) to find out what the best current practices are. Koumbit uses foreman/puppet but pinned at version 10.1 because it is the last one supporting "passenger" (the puppetmaster deployment method currently available in Debian, deprecated and dropped from puppet 6). They [patched it](https://redmine.koumbit.net/projects/theforeman-puppet/repository/revisions/5b1b0b42f2d7d7b01eacde6584d3) to support `puppetlabs/apache < 6`. They push to a bare repo on the puppet master, then they have validation hooks (the inspiration for our own hook implementation, see [issue 31226][]), and a hook deploys the code to the right branch. They were using r10k but stopped because they had issues when r10k would fail to deploy code atomically, leaving the puppetmaster (and all nodes!) in an unusable state. This would happen when their git servers were down without a locally cached copy. They also implemented branch cleanup on deletion (although that could have been done some other way). That issue was apparently reported against r10k but never got a response. They now use puppet-librarian in their custom hook. Note that it's possible r10k does not actually have that issue because they found the issue they filed and it was... [against librarian](https://github.com/voxpupuli/librarian-puppet/issues/73)! Some people in #voxpupuli seem to use the Puppetlabs Debian packages and therefore puppetserver, r10k and puppetboards. Their [Monolithic master](https://voxpupuli.org/docs/monolithic/) architecture uses an external git repository, which pings the puppetmaster through a [webhook](https://github.com/voxpupuli/puppet_webhook) which deploys a [control-repo](https://puppet.com/docs/pe/latest/control_repo.html) ([example](https://github.com/puppetlabs/control-repo)) and calls r10k to deploy the code. They also use [foreman](https://www.theforeman.org/) as a node classifier. that procedure uses the following modules: * [puppet/puppetserver](https://forge.puppet.com/puppet/puppetserver) * [puppetlabs/puppet_agent](https://forge.puppet.com/puppetlabs/puppet_agent) * [puppetlabs/puppetdb](https://forge.puppet.com/puppetlabs/puppetdb) * [puppetlabs/puppet_metrics_dashboard](https://forge.puppet.com/puppetlabs/puppet_metrics_dashboard) * [voxpupuli/puppet_webhook](https://github.com/voxpupuli/puppet_webhook) * [r10k](https://github.com/puppetlabs/r10k) or [g10k](https://github.com/xorpaul/g10k) * [Foreman](https://www.theforeman.org/) They also have a [master of masters](https://voxpupuli.org/docs/master_agent/) architecture for scaling to larger setups. For scaling, I have found [this article](https://puppet.com/blog/scaling-open-source-puppet/) to be more interesting, that said. So, in short, it seems people are converging towards r10k with a web hook. To validate git repositories, they mirror the repositories to a private git host.