puppet.md

TPA uses [Puppet](https://puppet.com/) to manage all servers it operates. It handles
most of the configuration management of the base operating system and
some services. It is *not* designed to handle ad hoc tasks, for which
we favor the use of [fabric](howto/fabric).

[[_TOC_]]

# Tutorial

This page is long! This first section hopes to get
you running with a simple task quickly.

## Adding an IP address to the global allow list

In this tutorial, we will add an IP address to the global allow list,
on all firewalls on all machines. This is a big deal! It will allow
that IP address to access the SSH servers on all boxes and more. This
should be an **static** IP address on a trusted network.

If you have never used Puppet before or are nervous at all
about making such a change, it is a good idea to have a more
experienced sysadmin nearby to help you. They can
also confirm this tutorial is what is actually needed.

 1. To any change on the Puppet server, you will first need to clone
    the git repository:

        git clone pauli.torproject.org:/srv/puppet.torproject.org/git/tor-puppet

    This needs to be only done once.

 2. The firewall rules are defined in the `ferm` module, which lives
    in `modules/ferm`. The file you specifically need to change is
    `modules/ferm/templates/defs.conf.erb`, so open that in your
    editor of choice:

        $EDITOR modules/ferm/templates/defs.conf.erb

 3. The code you are looking for is `ADMIN_IPS`. Add a `@def` for your
    IP address and add the new macro to the `ADMIN_IPS` macro. When
    you exit your editor, git should show you a diff that looks
    something like this:

        --- a/modules/ferm/templates/defs.conf.erb
        +++ b/modules/ferm/templates/defs.conf.erb
        @@ -77,7 +77,10 @@ def $TPO_NET = (<%= networks.join(' ') %>);
         @def $linus   = ();
         @def $linus   = ($linus 193.10.5.2/32); # kcmp@adbc
         @def $linus   = ($linus 2001:6b0:8::2/128); # kcmp@adbc
        -@def $ADMIN_IPS = ($weasel $linus);
        +@def $anarcat = ();
        +@def $anarcat = ($anarcat 203.0.113.1/32); # home IP
        +@def $anarcat = ($anarcat 2001:DB8::DEAD/128 2001:DB8:F00F::/56); # home IPv6
        +@def $ADMIN_IPS = ($weasel $linus $anarcat);


         @def $BASE_SSH_ALLOWED = ();

 4. Then you can commit this and *push*:

        git commit -m'add my home address to the allow list' && git push

 5. Then you should login to one of the hosts and make sure the code
    applies correctly:

        ssh -tt perdulce.torproject.org sudo puppet agent -t

Puppet shows colorful messages. If nothing is red and it returns
correctly, you are done. If that doesn't work, go back to step 2. If
that doesn't work, ask for help from your colleague in the Tor
sysadmin team.

If this works, congratulations, you have made your first change across
the entire Puppet infrastructure! You might want to look at the rest
of the documentation to learn more about how to do different tasks and
how things are setup. A key "How to" we recommend is the `Progressive
deployment` section below, which will teach you how to make a change
like the above while making sure you don't break anything even if it
affects a lot of machines.

# How to guides

## Modifying an existing configuration

For new deployments, this is *NOT* the preferred method. For example,
if you are deploying new software that is not already in use in our
infrastructure, do *not* follow this guide and instead follow the
`Adding a new module` guide below.

If you are touching an *existing* configuration, things are much
simpler however: you simply go to the module where the code already
exists and make changes. You `git commit` and `git push` the code,
then immediately run `puppet agent -t` on the affected node.

Look at the `File layout` section above to find the right piece of
code to modify. If you are making changes that potentially affect more
than one host, you should also definitely look at the `Progressive
deployment` section below.

## Adding a new module

This is a broad topic, but let's take the Prometheus monitoring system
as an example which followed the [role/profile/module][]
pattern.

First, the [Prometheus modules on the Puppet forge][] were evaluated
for quality and popularity. There was a clear winner there: the
[Prometheus module][] from [Vox Populi][] had hundreds of thousands
more downloads than the [next option][], which was deprecated.

[next option]: https://forge.puppet.com/brutus777/prometheus
[Vox Populi]: https://voxpupuli.org/
[Prometheus module]: https://forge.puppet.com/puppet/prometheus
[Prometheus modules on the Puppet forge]: https://forge.puppet.com/modules?q=prometheus

Next, the module was added to the Puppetfile (in
`3rdparty/Puppetfile`):

    mod 'puppet-prometheus', '6.4.0'

... and librarian was ran:

    librarian-puppet install

This fetched a lot of code from the Puppet forge: the stdlib, archive
and system modules were all installed or updated. All those modules
were audited manually, by reading each file and looking for obvious
security flaws or back doors. Then the code was committed into git:

    git add 3rdparty
    git commit -m'install prometheus module after audit'

Then the module was configured in a profile, in `modules/profile/manifests/prometheus/server.pp`:

    class profile::prometheus::server {
      class {
        'prometheus::server':
          # follow prom2 defaults
          localstorage        => '/var/lib/prometheus/metrics2',
          storage_retention   => '15d',
      }
    }

The above contains our local configuration for the upstream
`prometheus::server` class installed in the `3rdparty` directory. In
particular, it sets a retention period and a different path for the
metrics, so that they follow the new Prometheus 2.x defaults.

Then this profile was added to a *role*, in
`modules/roles/manifests/monitoring.pp`:

    # the monitoring server
    class roles::monitoring {
      include profile::prometheus::server
    }

Notice how the role does not refer to any implementation detail, like
that the monitoring server uses Prometheus. It looks like a trivial,
useless, class but it can actually grow to include *multiple*
profiles.

Then that role is added to the Hiera configuration of the monitoring
server, in `hiera/nodes/hetzner-nbg1-01.torproject.org.yaml`:

    classes:
      - roles::monitoring

And Puppet was ran on the host, with:

    puppet --enable ; puppet agent -t --noop ; puppet --disable "testing prometheus deployment"

This led to some problems as the upstream module doesn't support
installing from Debian packages. Support for Debian was added to the
code in `3rdparty/modules/prometheus`, and committed into git:

    emacs 3rdparty/modules/prometheus/manifests/*.pp # magic happens
    git commit -m'implement all the missing stuff' 3rdparty
    git push

And the above puppet command-line was ran again, continuing that loop
until things were good.

If you need to deploy the code to multiple hosts, see the `Progressive
deployment` section below. To contribute changes back upstream (and
you should do so), see the section right below.

## Contributing changes back upstream

For simple changes, the above workflow works well, but eventually it
is preferable to actually fork the upstream repository and operate on our
fork until the changes are merged upstream.

First, the modified module is moved out of the way:

    mv 3rdparty/modules/prometheus{,.orig}

The module is then forked on GitHub or wherever it is hosted, and then
added to the Puppetfile:

    mod 'puppet-prometheus',
        :git => 'https://github.com/anarcat/puppet-prometheus.git',
        :branch => 'deploy'

Then Librarian is ran again to fetch that code:

    librarian-puppet install

Because Librarian is a little dumb, it might checkout your module in
"detached head" mode, in which case you will want to fix the checkout:

    cd 3rdparty/modules/prometheus
    git checkout deploy
    git reset --hard origin/deploy
    git pull

Note that the `deploy` branch here is a merge of all the different
branches proposed upstream in different pull requests, but it could
also be the `master` branch or a single branch if only a single pull
request was sent.

Since you now have a clone of the upstream repository, you can push
and pull normally with upstream. When you make a change, however, you
need to commit (and push) the change *both* in the sub-repository and the
main repository:

    cd 3rdparty/modules/prometheus
    $EDITOR manifests/init.pp # more magic stuff
    git commit -m'change the frobatz to a argblu'
    git push
    cd ..
    git commit -m'change the frobatz to a argblu'
    git push

Often, I make commits directly in our main Puppet repository, without
pushing to the third party fork, until I am happy with the code, and
then I craft a nice pretty commit that can be pushed upstream,
reversing that process:

    $EDITOR 3rdparty/prometheus/manifests/init.pp # dirty magic stuff
    git commit -m'change the frobatz to a quuxblah'
    git push
    # see if that works, generally not
    git commit -m'rah. wanted a quuxblutz'
    git push
    # now we are good, update our pull request
    cd 3rdparty/modules/prometheus
    git commit -m'change the frobatz to a quuxblutz'
    git push

It's annoying to double-commit things, but I haven't found a best way
to do so just yet. This problem is further discussed in [ticket #29387][].

Also note that when you update code like this, the `Puppetfile` does
not change, but the `Puppetfile.lock` file *does* change. The `GIT.sha`
parameter needs to be updated. This can be done by hand, but since
that is error-prone, you might want to simply run this to update
modules:

    librarian-puppet update

This will *also* update dependencies so make sure you audit those
changes before committing and pushing.

## Running tests

Ideally, Puppet modules have a test suite. This is done with
[rspec-puppet](https://rspec-puppet.com/) and [rspec-puppet-facts](https://github.com/mcanevet/rspec-puppet-facts). This is not very well
documented upstream, but it's apparently part of the [Puppet
Development Kit](https://puppet.com/docs/pdk/1.x/pdk.html) (PDK). Anyways: assuming tests exists, you will
want to run some tests before pushing your code upstream, or at least
upstream might ask you for this before accepting your changes. Here's
how to get setup:

    sudo apt install ruby-rspec-puppet ruby-puppetlabs-spec-helper ruby-bundler
    bundle install --path vendor/bundle

This installs some basic libraries, system-wide (Ruby bundler and the
rspec stuff). Unfortunately, required Ruby code is rarely all present
in Debian and you still need to install extra gems. In this case we
set it up within the `vendor/bundle` directory to isolate them from
the global search path.

Finally, to run the tests, you need to wrap your invocation with
`bundle exec`, like so:

    bundle exec rake test

## Listing all hosts under puppet

This will list all active hosts known to the Puppet master:

    ssh -t pauli.torproject.org 'sudo -u postgres psql puppetdb -P pager=off -A -t -c "SELECT c.certname FROM certnames c WHERE c.deactivated IS NULL"'

The following will list all hosts under Puppet and their `virtual`
value:

    ssh -t pauli.torproject.org "sudo -u postgres psql puppetdb -P pager=off -F',' -A -t -c \"SELECT c.certname, value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id INNER JOIN certnames c ON c.certname = fs.certname WHERE fp.name = 'virtual' AND c.deactivated IS NULL\""  | tee hosts.csv

The resulting file is a Comma-Separated Value (CSV) file which can be
used for other purposes later.

Possible values of the `virtual` field can be obtain with a similar
query:

    ssh -t pauli.torproject.org "sudo -u postgres psql puppetdb -P pager=off -A -t -c \"SELECT DISTINCT value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id WHERE fp.name = 'virtual';\""

The currently known values are: `kvm`, `physical`, and `xenu`.

As a bonus, this query will show the number of hosts running each release:

    SELECT COUNT(c.certname), value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id INNER JOIN certnames c ON c.certname = fs.certname WHERE fp.name = 'lsbdistcodename' AND c.deactivated IS NULL GROUP BY value_string;

### Other ways of extracting a host list

 * Using the [PuppetDB API][]:

        curl -s -G http://localhost:8080/pdb/query/v4/facts  | jq -r ".[].certname"

   The [fact API][] is quite extensive and allows for very complex
   queries. For example, this shows all hosts with the `apache2` fact
   set to `true`:

        curl -s -G http://localhost:8080/pdb/query/v4/facts --data-urlencode 'query=["and", ["=", "name", "apache2"], ["=", "value", true]]' | jq -r ".[].certname"

   This will list all hosts sorted by their report date, older first,
   followed by the timestamp, space-separated:

        curl -s -G http://localhost:8080/pdb/query/v4/nodes  | jq -r 'sort_by(.report_timestamp) | .[] | "\(.certname) \(.report_timestamp)"' | column -s\  -t

   This will list all hosts with the `roles::static_mirror` class:

        curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { resources { type = "Class" and title = "Roles::Static_mirror" }} ' | jq .[].certname

   This will show all hosts running Debian buster:

        curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=nodes { facts { name = "lsbdistcodename" and value = "buster" }}' | jq .[].certname

 * Using [howto/cumin](howto/cumin)

 * Using LDAP:

        HOSTS=$(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b dc=torproject,dc=org -LLL "hostname=*.torproject.org" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort')
        for i in `echo $HOSTS`; do mkdir hosts/x-$i 2>/dev/null || continue; echo $i; ssh $i ' ...'; done

    the `mkdir` is so that I can run the same command in many terminal
    windows and each host gets only one once

 [PuppetDB API]: https://puppet.com/docs/puppetdb/4.3/api/index.html
 [fact API]: https://puppet.com/docs/puppetdb/4.3/api/query/v4/facts.html

## Batch jobs on all hosts

With that trick, a job can be ran on all hosts with
[parallel-ssh][], for example, check the `uptime`:

    cut -d, -f1 hosts.hsv | parallel-ssh -i -h /dev/stdin uptime

This would do the same, but only on physical servers:

    grep 'physical$' hosts.hsv | cut -d -f1 | parallel-ssh -i -h /dev/stdin uptime

This would fetch the `/etc/motd` on all machines:

    cut -d -f1 hosts.csv | parallel-slurp -h /dev/stdin -L motd /etc/motd motd

To run batch commands through `sudo` that requires a password, you will need to fool both `sudo` and ssh a little more:

    cut -d -f1 hosts.csv | parallel-ssh -P -I -i -x -tt -h /dev/stdin -o pvs sudo pvs

You should then type your password then Control-d. Warning: this will
show your password on your terminal and probably in the logs as well.

Batch jobs can also be ran on all Puppet hosts with Cumin:

    ssh -N -L8080:localhost:8080 pauli.torproject.org &
    cumin '*' uptime

See [howto/cumin](howto/cumin) for more examples.

 [parallel-ssh]: https://parallel-ssh.org/

## Progressive deployment

If you are making a major change to the infrastructure, you may want
to deploy it progressively. A good way to do so is to include the new
class manually in the node configuration, say in
`hiera/nodes/$fqdn.yaml`:

    classes:
      - my_new_class

Then you can check the effect of the class on the host with the
`--noop` mode. Make sure you disable Puppet so that automatic runs do
not actually execute the code, with:

    puppet agent --disable "testing my_new_class deployment"

Then the new manifest can be simulated with this command:

    puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment"

Examine the output and, once you are satisfied, you can re-enable the
agent and actually run the manifest with:

    puppet agent --enable ; puppet agent -t

If the change is *inside* an existing class, that change can be
enclosed in a class parameter and that parameter can be passed as an
argument from Hiera. This is how the transition to a managed
`/etc/apt/sources.list` file was done:

 1. first, a parameter was added to the class that would remove the
    file, defaulting to `false`:

        class torproject_org(
          Boolean $manage_sources_list = false,
        ) {
          if $manage_sources_list {
            # the above repositories overlap with most default sources.list
            file {
              '/etc/apt/sources.list':
                ensure => absent,
            }
          }
        }

 2. then that parameter was enabled on one host, say in
    `hiera/nodes/brulloi.torproject.org.yaml`:

        torproject_org::manage_sources_list: true

 3. Puppet was run on that host using the simulation mode:

        puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment"

 4. when satisfied, the real operation was done:

        puppet agent --enable ; puppet agent -t --noop

 5. then this was added to two other hosts, and Puppet was ran there

 6. finally, all hosts were checked to see if the file was present on
    hosts and had any content, with [howto/cumin](howto/cumin) (see above for
    alternative way of running a command on all hosts):

        cumin '*' 'du /etc/apt/sources.list'

 7. since it was missing everywhere, the parameter was set to `true`
    by default and the custom configuration removed from the three
    test nodes

 8. then Puppet was ran by hand everywhere, using Cumin, with a batch
    of 5 hosts at a time:

        cumin -o txt -b 5 '*' 'puppet agent -t'

    because Puppet returns a non-zero value when changes are made,
    this will above when any one host in a batch of 5 will actually
    operate a change. You can then examine the output and see if the
    change is legitimate or abort the configuration change.

## Debugging things

When a Puppet manifest is not behaving as it should, the first step is
to run it by hand on the host:

    puppet agent -t

If that doesn't yield enough information, you can see pretty much
everything that Puppet does with the `--debug` flag. This will, for
example, include `Exec` resources `onlyif` commands and allow you to
see why they do not work correctly (a common problem):

    puppet agent -t --debug

Finally, some errors show up only on the Puppet server: you can look in
`/var/log/daemon.log` there for errors that will only show up there.

Connecting to the PuppetDB database itself can sometimes be easier
than trying to operate the API. There you can inspect the entire thing
as a normal SQL database, use this to connect:

    sudo -u postgres psql puppetdb

It's possible exported resources do surprising things sometimes. It is
useful to look at the actual PuppetDB to figure out which tags
exported resources have. For example, this query lists all exported
resources with `troodi` in the name:

    SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE exported = 't' AND title LIKE '%troodi%';

Keep in mind that there are [automatic tags](https://puppet.com/docs/puppet/6.4/lang_tags.html) in exported resources
which can complicate things.

## Password management

If you need to set a password in a manifest, there are special
functions to handle this. We do not want to store passwords directly
in Puppet source code, for various reasons: it is hard to erase
because code is stored in git, but also, ultimately, we want to
publish that source code publicly.

We have two mechanisms on how to do this now: a HKDF to generate
passwords by hashing a common secret, and Trocla, which generates
random passwords and stores the hash or, if necessary, the clear-text
in a YAML file.. The HKDF function is deprecated and should be
[replaced by Trocla][trocla-migration] eventually.

[trocla-migration]: https://bugs.torproject.org/30009

### hkdf

NOTE: this procedure is DEPRECATED and Trocla should be used instead,
see the [trocla migration ticket][trocla-migration] for details.

Old passwords in Puppet are managed through a [Key Derivation
Function][] (KDF), more specifically a [hash-based KDF][] that takes a
secret stored on the Puppet master (in `/etc/puppet/secret`)
concatenates this with a unique token picked by the caller, and
generates a secret unique to that token. An example:

[hash-based KDF]: https://en.wikipedia.org/wiki/HKDF
[Key Derivation Function]: https://en.wikipedia.org/wiki/Key_derivation_function

    $secret = hkdf('/etc/puppet/secret', "dip-${::hostname}-base-secret")

This generates a unique passwords for the given token. The password is
then used, in clear text, by the puppet client as appropriate.

The function is an implementation of [RFC5869][], a [SHA256][]-based
HKDF taken from an earlier version of [John Downey's Rubygems
implementation][].

[John Downey's Rubygems implementation]: https://rubygems.org/gems/hkdf
[RFC5869]: https://tools.ietf.org/html/rfc5869
[SHA256]: https://en.wikipedia.org/wiki/SHA-2

### Trocla

[Trocla][] is another password-management solution that takes another
approach. With Trocla, each password is generated on the fly from a
secure entropy source ([Ruby's SecureRandom module][]) and stored
inside a state file (in `/var/lib/trocla/trocla_data.yml`, configured
`/etc/puppet/troclarc.yaml`) on the Puppet master.

Trocla can return "hashed" versions of the passwords, so that the
plain text password is never visible from the client. The plain text
can still be stored on the Puppet master, or it can be deleted once
it's been transmitted to the user or another password manager. This
makes it possible to have Trocla not keep any secret at all.

[Ruby's SecureRandom module]: https://ruby-doc.org/stdlib-1.9.3/libdoc/securerandom/rdoc/SecureRandom.html
[Trocla]: https://github.com/duritong/trocla

This piece of code will generate a [bcrypt][]-hashed password for the
Grafana admin, for example:

    $grafana_admin_password = trocla('grafana_admin_password', 'bcrypt')

The plain-text for that password will never leave the Puppet master. it
will still be stored on the Puppet master, and you can see the value
with:

    trocla get grafana_admin_password plain

... on the command-line.

[bcrypt]: https://en.wikipedia.org/wiki/Bcrypt

A password can also be set with this command:

    trocla set grafana_guest_password plain

Note that this might *erase* other formats for this password, although
those will get regenerated as needed.

Also note that `trocla get` will fail if the particular password or
format requested does not exist. For example, say you generate a
plain-text password with and then get the `bcrypt` version:

    trocla create test plain
    trocla get test bcrypt

This will return the empty string instead of the hashed
version. Instead, use `trocla create` to generate that password. In
general, it's safe to use `trocla create` as it will reuse existing
password. It's actually how the `trocla()` function behaves in Puppet
as well.


## Getting information from other nodes

### Exported resources

Our Puppet configuration supports [exported resources](https://puppet.com/docs/puppet/latest/lang_exported.html), a key
component of complex Puppet deployments. Exported resources allow one
host to define a configuration that will be *exported* to the Puppet
server and then *realized* on another host.

We commonly use this to punch holes in the firewall between nodes. For
example, this manifest in the `roles::puppetmaster` class:

    @@ferm::rule::simple { "roles::puppetmaster-${::fqdn}":
        tag         => 'roles::puppetmaster',
        description => 'Allow Puppetmaster access to LDAP',
        port        => ['ldap', 'ldaps'],
        saddr       => $base::public_addresses,
      }

... exports a firewall rule that will, later, allow the Puppet server
to access the LDAP server (hence the `port => ['ldap', 'ldaps']`
line). This rule doesn't take effect on the host applying the
`roles::puppetmaster` class, but only on the LDAP server, through this
rather exotic syntax:

    Ferm::Rule::Simple <<| tag == 'roles::puppetmaster' |>>

This tells the LDAP server to apply whatever rule was exported with
the `@@` syntax and the specified `tag`. Any Puppet resource can be
exported and realized that way.

Note that there are security implications with collecting exported
resources: it delegates the resource specification of a node to
another. So, in the above scenario, the Puppet master could decide to
open *other* ports on the LDAP server (say, the SSH port), because it
exports the port number and the LDAP server just blindly applies the
directive. A more secure specification would explicitly specify the
sensitive information, like so:

    Ferm::Rule::Simple <<| tag == 'roles::puppetmaster' |>> {
        port => ['ldap'],
    }

But then a compromised server could send a different `saddr` and
there's nothing the LDAP server could do here: it cannot override the
address because it's exactly the information we need from the other
server...

### PuppetDB lookups

A common pattern in Puppet is to extract information from host A and
use it on host B. The above "exported resources" pattern can do this
for files, commands and many more resources, but sometimes we just
want a tiny bit of information to embed in a configuration file. This
could, in theory, be done with an exported [concat](https://forge.puppet.com/puppetlabs/concat) resource, but
this can become prohibitively complicated for something as simple as
an allowed IP address in a configuration file.

For this we use the [puppetdbquery module](https://github.com/dalen/puppet-puppetdbquery), which allows us to do
elegant queries against PuppetDB. For example, this will extract the
IP addresses of all nodes with the `roles::gitlab` class applied:

    $allow_ipv4 = query_nodes('Class[roles::gitlab]', 'networking.ip')
    $allow_ipv6 = query_nodes('Class[roles::gitlab]', 'networking.ip6')

This code, in `profile::kgb_bot`, propagates those variables into a
template through the `allowed_addresses` variable, which gets expanded
like this:

    <% if $allow_addresses { -%>
    <% $allow_addresses.each |String $address| { -%>
        allow <%= $address %>;
    <% } -%>
        deny all;
    <% } -%>

Note that there is a potential security issue with that approach. The
same way that exported resources trust the exporter, we trust that the
node exported the right fact. So it's in theory possible that a
compromised Puppet node exports an evil IP address in the above
example, granting access to an attacker instead of the proper node.

Also note that this will eventually fail when the node goes down:
after a while, resources are expired from the PuppetDB server and the
above query will return an empty list. This seems reasonable: we do
want to eventually revoke access to nodes that go away, but it's still
something to keep in mind.

Note that this could also be implemented with a `concat` exported
resource, but much harder because you would need some special case
when no resource is exported (to avoid adding the `deny`) and take
into account that other configuratinos might also be needed in the
file. It would have the same security and expiry issues anyways.

### Puppet query language

Note that there's also a way to do those queries without a Forge
module, through the [Puppet query language](https://puppet.com/docs/puppetdb/5.2/api/query/tutorial-pql.html) and the
`puppetdb_query` function. The problem with that approach is that the
function is not very well documented and the query syntax is somewhat
obtuse. For example, this is what I came up with to do the equivalent
of the `query_nodes` call, above:

    $allow_ipv4 = puppetdb_query(
      ['from', 'facts',
        ['and',
          ['=', 'name', 'networking.ip'],
          ['in', 'certname',
            ['extract', 'certname',
              ['select_resources',
                ['and',
                  ['=', 'type', 'Class'],
                  ['=', 'title', 'roles::gitlab']]]]]]])

It seems like I did something wrong, because that returned an empty
array. I could not figure out how to debug this, and apparently I
neded more functions (like `map` and `filter`) to get what I wanted
(see [this gist](https://gist.github.com/bastelfreak/b9620fa1892ebcc659c442b115db34f9)). I gave up at that point: the `puppetdbquery`
abstraction is much cleaner and usable.

If you are merely looking for a hostname, however, PQL might be a
little more manageable. For example, this is how the
`roles::onionoo_frontend` class finds its backends to setup the
[IPsec](ipsec) network:

    $query = 'nodes[certname] { resources { type = "Class" and title = "Roles::Onionoo_backend" } }'
    $peer_names = sort(puppetdb_query($query).map |$value| { $value["certname"] })
    $peer_names.each |$peer_name| {
      $network_tag = [$::fqdn, $peer_name].sort().join('::')
      ipsec::network { "ipsec::${network_tag}":
        peer_networks => $base::public_addresses
      }
    }

### LDAP lookups

Our Puppet server is hooked up to the LDAP server and has information
about the hosts defined there. Information about the node running the
manifest is available in the global `$nodeinfo` variable, but there is
also a `$allnodeinfo` parameter with information about every host
known in LDAP.

A simple example of how to use the `$nodeinfo` variable is how the
`base::public_address` and `base::public_address6` parameters -- which
represent the IPv4 and IPv6 public address of a node -- are
initialized in the `base` class:

    class base(
      Stdlib::IP::Address $public_address            = filter_ipv4(getfromhash($nodeinfo, 'ldap', 'ipHostNumber'))[0],
      Optional[Stdlib::IP::Address] $public_address6 = filter_ipv6(getfromhash($nodeinfo, 'ldap', 'ipHostNumber'))[0],
    ) {
      $public_addresses = [ $public_address, $public_address6 ].filter |$addr| { $addr != undef }
    }

This loads the `ipHostNumber` field from the `$nodeinfo` variable, and
uses the `filter_ipv4` or `filter_ipv6` functions to extract the IPv4
or IPv6 addresses respectively.

A good example of the `$allnodeinfo` parameter is how the
`roles::onionoo_frontend` class finds the IP addresses of its
backend. After having loaded the host list from PuppetDB, it then uses
the parameter to extract the IP address:

    $backends = $peer_names.map |$name| {
        [
          $name,
          $allnodeinfo[$name]['ipHostNumber'].filter |$a| { $a =~ Stdlib::IP::Address::V4 }[0]
        ] }.convert_to(Hash)

Such a lookup is considered more secure than going through PuppetDB as
LDAP is a trusted data source. It is also our source of truth for this
data, at the time of writing.

### Hiera lookups

For more security-sensitive data, we should use a trusted data source
to extract information about hosts. We do this through Hiera lookups,
with the [lookup](https://puppet.com/docs/puppet/latest/function.html#lookup) function. A good example is how we populate the
SSH public keys on all hosts, for the admin user. In the
`profile::ssh` class, we do the following:

    $keys = lookup('profile::admins::keys', Data, 'hash')

This will lookup the `profile::admin::keys` field in Hiera, which is a
trusted source because under the control of the Puppet git repo. This
refers to the following data structure in `hiera/common.yaml`:

    profile::admins::keys:
      anarcat:
        type: "ssh-rsa"
        pubkey: "AAAAB3[...]"

The key point with Hiera is that it's a "hierarchical" data structure,
so each host can have its own override. So in theory, the above keys
could be overriden per host. Similarly, the IP address information for
each host could be stored in Hiera instead of LDAP. But in practice,
we do not currently do this and the per-host information is limited.

## Revoking and generating a new certificate for a host

Revocation procedures problems were discussed in [33587][] and [33446][].

[33587]: https://bugs.torproject.org/33587
[33446]: https://gitlab.torproject.org/legacy/trac/-/issues/33446#note_2349434

 1. Clean the certificate on the master

        puppet cert clean host.torproject.org

 2. Clean the certificate on the client:

        find /var/lib/puppet/ssl -name host.torproject.org.pem -delete

 3. Then run the bootstrap script on the client from
    `tsa-misc/installer/puppet-bootstrap-client ` and get a new checksum

 4. Run `tpa-puppet-sign-client` on the master and pass the checksum

 5. Run `puppet agent -t` to have puppet running on the client again.

## Pager playbook

### catalog run: PuppetDB warning: did not update since...

If you see an error like:

    Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z

It can also be eventually accompanied with the puppet server reporting
the same problem:

    Subject: ** PROBLEM Service Alert: pauli/puppet - all catalog runs is WARNING **
    [...]
    Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z

One of the following is happening, in decreasing likeliness:

 1. the node's Puppet manifest has an error of some sort that makes it
    impossible to run the catalog
 2. the node is down and has failed to report since the last time
    specified
 3. the Puppet **server** is down and **all** nodes will fail to
    report in the same way (in which case a lot more warnings will
    show up, and other warnings about the server will come in)

The first situation will usually happen after someone pushed a commit
introducing the error. We try to keep all manifests compiling all the
time and such errors should be immediately fixed. Look at the history
of the Puppet source tree and try to identify the faulty
commit. Reverting such a commit is acceptable to restore the service.

The second situation can happen if a node is in maintenance for an
extended duration. Normally, the node will recover when it goes back
online. If a node is to be permanently retired, it should be removed
from Puppet, using the [host retirement procedures][retire-a-host].

Finally, if the main Puppet **server** is down, it should definitely
be brought back up. See disaster recovery, below.

In any case, running the Puppet agent on the affected node should give
more information:

    ssh NODE puppet agent -t

### Problems pushing to the Puppet server

Normally, when you push new commits to the Puppet server, a hook runs
and updates the working copy. But sometimes this fails with an error
like:

    remote: error: unable to unlink old 'modules/ipsec/misc/config.yaml': Permission denied.

The problem, in such cases, is that the files in the `/etc/puppet/`
checkout are not writable by your user. It could also happen that the
repository itself (in `/srv/puppet.torproject.org/git/tor-puppet`)
could have permission issues.

This problem is described in [issue 29663][] and is due to someone
not pushing properly before you. To fix the permissions, try:

    sudo chown -R root:adm /etc/puppet
    sudo chown :puppet /etc/puppet/secret
    sudo chmod -R g+rw /etc/puppet
    sudo chmod g-w /etc/puppet/secret

[issue 29663]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29663

A similar recipe could be applied to the git repository, as
needed. Hopefully this will be resolved when we start deploying with a
role account instead.

## Disaster recovery

Ideally, the main Puppet server would be deployable from Puppet
bootstrap code and the [main installer](new-machine). But in practice, much of
its configuration was done manually over the years and it MUST be
restored from [backups](backup) in case of failure.

This probably includes a restore of the [PostgreSQL](postgresql) database
backing the PuppetDB server as well. It's *possible* this step *could*
be skipped in an emergency, because most of the information in
PuppetDB is a cache of exported resources, reports and facts. But it
could also break hosts and make converging the infrastructure
impossible, as there might be dependency loops in exported resources.

In particular, the Puppet server needs access to the LDAP server, and
that is configured in Puppet. So if the Puppet server needs to be
rebuilt from scratch, it will need to be manually allowed access to
the LDAP server to compile its manifest.

So it is strongly encouraged to restore the PuppetDB server database
as well in case of disaster.

This also applies in case of an IP address change of the Puppet
server, in which case access to the LDAP server needs to be manually
granted before the configuration can run and converge. This is a known
bootstrapping issue with the Puppet server and is further discussed in
the [design section](#LDAP-integration).

# Reference

This documents generally how things are setup.

## Installation

TODO. It is not yet clear how the Puppetmaster was setup or how to
build a new one. The interactions with other tools like Nagios and
LDAP especially need to be documented.

## SLA

No formal SLA is defined. Puppet runs on a fairly slow `cron` job so
doesn't have to be highly available right now. This could change in
the future if we rely more on it for deployments.

## Design

TODO. review
<https://bluesock.org/~willkg/blog/dev/auditing_projects.html> and
expand this design accordingly.

TODO: add a lead here.

<!-- how this is built -->
<!-- should reuse and expand on the "proposed solution", it's a -->
<!-- "as-built" documented, whereas the "Proposed solution" is an -->
<!-- "architectural" document, which the final result might differ -->
<!-- from, sometimes significantly -->

### Glossary

This is a subset of the [Puppet glossary](https://puppet.com/docs/puppet/latest/glossary.html) to quickly get you
started with the vocabulary used in this document.

 * **Puppet node**: a machine (virtual or physical) running Puppet
 * **Manifest**: Puppet source code
 * **Catalog**: a set of compiled of Puppet source which gets applied
   on a **node** by a **Puppet agent**
 * **Puppet agents**: the Puppet program that runs on all nodes to
   apply manifests
 * **Puppet server**: the server which all **agents** connect to to
   fetch their **catalog**, also known as a **Puppet master** in older
   Puppet versions (pre-6)
 * **Facts**: information collected by Puppet agents on nodes, and
   exported to the Puppet server
 * **Reports**: log of changes done on nodes recorded by the Puppet
   server
 * **[PuppetDB](https://puppet.com/docs/puppetdb/) server**: an application server on top of a PostgreSQL
   database providing an [API](https://puppet.com/docs/puppetdb/5.2/api/index.html) to query various resources like node
   names, facts, reports and so on

### File layout

The Puppet server and PuppetDB server run on
`pauli.torproject.org`. That is where the main git repository
(`tor-puppet`) lives, in
`/srv/puppet.torproject.org/git/tor-puppet`. That repository has hooks
to populate `/etc/puppet` which is the live checkout from which the
Puppet server compiles its catalogs.
  
All paths below are relative to the root of that git repository.

- `3rdparty/modules` include modules that are shared publicly and do
  not contain any TPO-specific configuration. There is a `Puppetfile`
  there that documents where each module comes from and that can be
  maintained with [r10k][] or [librarian][].

  [librarian]: https://librarian-puppet.com/
  [r10k]: https://github.com/puppetlabs/r10k/

- `modules` includes roles, profiles, and classes that make the bulk
  of our configuration.

- each node is assigned a "role" through Hiera, in
  `hiera/nodes/$FQDN.yaml`

  To be more accurate, Hiera assigns a Puppet class to each node,
  although each node should have only one special purpose class, a
  "role", see [issue 40030][] for progress on that transition.

[issue 40030]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40030

- The `torproject_org` module
  (`modules/torproject_org/manifests/init.pp`) performs basic host
  initialisation, like configuring Debian mirrors and APT sources,
  installing a base set of packages, configuring puppet and timezone,
  setting up a bunch of configuration files and running `ud-replicate`.

- There is also the `hoster.yaml` file
  (`modules/torproject_org/misc/hoster.yaml`) which defines hosting