puppet.md


### Finding exported resources with SQL queries

Connecting to the PuppetDB database itself can sometimes be easier
than trying to operate the API. There you can inspect the entire thing
as a normal SQL database, use this to connect:

    sudo -u postgres psql puppetdb

It's possible exported resources do surprising things sometimes. It is
useful to look at the actual PuppetDB to figure out which tags
exported resources have. For example, this query lists all exported
resources with `troodi` in the name:

    SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE exported = 't' AND title LIKE '%troodi%';

Keep in mind that there are [automatic tags](https://puppet.com/docs/puppet/6.4/lang_tags.html) in exported resources
which can complicate things.

### Finding all instances of a deployed resource

Say you want to [deprecate cron](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41303). You want to see where the `Cron`
resource is used to understand how hard of a problem this is.

This will show you the resource titles and how many instances of each
there are:

    SELECT count(*),title FROM catalog_resources WHERE type = 'Cron' GROUP BY title ORDER by count(*) DESC;

Example output:

    puppetdb=# SELECT count(*),title FROM catalog_resources WHERE type = 'Cron' GROUP BY title ORDER by count(*) DESC;
     count |              title              
    -------+---------------------------------
        87 | puppet-cleanup-clientbucket
        81 | prometheus-lvm-prom-collector-
         9 | prometheus-postfix-queues
         6 | docker-clear-old-images
         5 | docker-clear-nightly-images
         5 | docker-clear-cache
         5 | docker-clear-dangling-images
         2 | collector-service
         2 | onionoo-bin
         2 | onionoo-network
         2 | onionoo-service
         2 | onionoo-web
         2 | podman-clear-cache
         2 | podman-clear-dangling-images
         2 | podman-clear-nightly-images
         2 | podman-clear-old-images
         1 | update rt-spam-blocklist hourly
         1 | update torexits for apache
         1 | metrics-web-service
         1 | metrics-web-data
         1 | metrics-web-start
         1 | metrics-web-start-rserve
         1 | metrics-network-data
         1 | rt-externalize-attachments
         1 | tordnsel-data
         1 | tpo-gitlab-backup
         1 | tpo-gitlab-registry-gc
         1 | update KAM ruleset
    (28 rows)

A more exhaustive list of each resource and where it's declared:

    SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE type = 'Cron';

Which host uses which resource:

    SELECT certname,title FROM catalog_resources JOIN certnames ON certname_id=certnames.id WHERE type = 'Cron' ORDER BY certname;

Top 10 hosts using the resource:

    puppetdb=# SELECT certname,count(title) FROM catalog_resources JOIN certnames ON certname_id=certnames.id WHERE type = 'Cron' GROUP BY certname ORDER BY count(title) DESC LIMIT 10;
                 certname              | count 
    -----------------------------------+-------
     meronense.torproject.org          |     7
     forum-01.torproject.org           |     7
     ci-runner-x86-02.torproject.org   |     7
     onionoo-backend-01.torproject.org |     6
     onionoo-backend-02.torproject.org |     6
     dangerzone-01.torproject.org      |     6
     btcpayserver-02.torproject.org    |     6
     chi-node-14.torproject.org        |     6
     rude.torproject.org               |     6
     minio-01.torproject.org           |     6
    (10 rows)

### Finding exported resources with PuppetDB

This query will look for exported resources with the `type`
`Backupninja::Server::Account` (which can be a class, define, or
builtin resource) and a `title` (the "name" of the resource as defined
in the manifests) of `backup-blah@backup.koumbit.net`:

    curl -s -X POST http://localhost:8080/pdb/query/v4 \
        -H 'Content-Type:application/json' \
        -d '{"query": "resources { type = \"Backupninja::Server::Account\" and title = \"backup-blah@backup.koumbit.net\" }"}' \
        | jq . | less -SR

TODO: update the above query to match resources actually in use at
TPO. That example is from koumbit.org folks.

### Examining a Puppet catalog

It can sometimes be useful to examine a node's catalog in order to
determine if certain resources are present, or to view a resource's
full set of parameters.

#### List resources by type

To list all `service` resources managed by Puppet on a node, the
command below may be executed on the node itself:

    puppet catalog select --terminus rest "$(hostname -f)" service

At the end of the command line, `service` may be replaced by any
built-in resource types such as `file` or `cron`. Defined resource
names may also be used here, like `ssl::service`.

#### View/filter full catalog

To extract a node's full catalog in JSON format:

    puppet catalog find --terminus rest "$(hostname -f)"

The output can be manipulated using `jq` to extract more precise
information. For example, to list all resources of a specific type:

    jq '.resources[] | select(.type == "File") | .title' < catalog.json

To list all classes in the catalog:

    jq '.resources[] | select(.type=="Class") | .title' < catalog.json

To display a specific resource selected by title:

    jq '.resources[] | select((.type == "File") and (.title=="sources.list.d"))' < catalog.json

More examples can be found on this [blog post](http://web.archive.org/web/20210122003128/https://alexharv074.github.io/puppet/2017/11/30/jq-commands-for-puppet-catalogs.html).OB

## Pager playbook

### catalog run: PuppetDB warning: did not update since \[...\]

If you see an error like:

    Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z

It can also be eventually accompanied with the puppet server reporting
the same problem:

    Subject: ** PROBLEM Service Alert: pauli/puppet - all catalog runs is WARNING **
    [...]
    Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z

One of the following is happening, in decreasing likeliness:

 1. the node's Puppet manifest has an error of some sort that makes it
    impossible to run the catalog
 2. the node is down and has failed to report since the last time
    specified
 3. the Puppet **server** is down and **all** nodes will fail to
    report in the same way (in which case a lot more warnings will
    show up, and other warnings about the server will come in)

The first situation will usually happen after someone pushed a commit
introducing the error. We try to keep all manifests compiling all the
time and such errors should be immediately fixed. Look at the history
of the Puppet source tree and try to identify the faulty
commit. Reverting such a commit is acceptable to restore the service.

The second situation can happen if a node is in maintenance for an
extended duration. Normally, the node will recover when it goes back
online. If a node is to be permanently retired, it should be removed
from Puppet, using the [host retirement procedures](howto/retire-a-host).

Finally, if the main Puppet **server** is down, it should definitely
be brought back up. See disaster recovery, below.

In any case, running the Puppet agent on the affected node should give
more information:

    ssh NODE puppet agent -t

### Problems pushing to the Puppet server

If you get this error when pushing commits to the Puppet server:

    error: remote unpack failed: unable to create temporary object directory

... or, longer version:

    anarcat@curie:tor-puppet$ LANG=C git push 
    Enumerating objects: 7, done.
    Counting objects: 100% (7/7), done.
    Delta compression using up to 4 threads
    Compressing objects: 100% (3/3), done.
    Writing objects: 100% (4/4), 772 bytes | 772.00 KiB/s, done.
    Total 4 (delta 2), reused 0 (delta 0), pack-reused 0
    error: remote unpack failed: unable to create temporary object directory
    To puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet
     ! [remote rejected]   master -> master (unpacker error)
    error: failed to push some refs to 'puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet'
    anarcat@curie:tor-puppet[1]$

It's because you're not using the `git` role account. Update your
remote URL configuration to use `git@puppet.torproject.org` instead,
with:

    git remote set-url origin git@puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet.git

This is because we have switched to a role user for pushing changes to
the Git repository, see [issue 29663][] for details.

[issue 29663]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29663

### Error: The CRL issued by 'CN=Puppet CA: pauli.torproject.org' has expired

This error causes the Puppet agent to abort its runs.

Check the expiry date for the Puppet CRL file at `/var/lib/puppet/crl.pem`:

    cumin '*' 'openssl crl -in /var/lib/puppet/ssl/crl.pem -text | grep "Next Update"'

If the date is in the past, the node won't be able to get a catalog from the
Puppet server.

An up-to-date CRL may be retrieved from the Puppet server and installed as such:

    curl --silent --cert /var/lib/puppet/ssl/certs/$(hostname -f).pem \
      --key /var/lib/puppet/ssl/private_keys/$(hostname -f).pem \
      --cacert /var/lib/puppet/ssl/certs/ca.pem \
      --output /var/lib/puppet/ssl/crl.pem \
      "https://puppet:8140/puppet-ca/v1/certificate_revocation_list/ca?environment=production"

TODO: shouldn't the Puppet agent be updating the CRL on its own?

### Puppet server CA renewal

TODO: no procedure established yet, some thoughts:

https://dev.to/betadots/extending-puppet-ca-38l8

The `installer/puppet-bootstrap-client` in `fabric-tasks.git` must also be
updated.

This is not expected to happen before year 2039.

## Disaster recovery

Ideally, the main Puppet server would be deployable from Puppet
bootstrap code and the [main installer](new-machine). But in practice, much of
its configuration was done manually over the years and it MUST be
restored from [backups](backup) in case of failure.

This probably includes a restore of the [PostgreSQL](postgresql) database
backing the PuppetDB server as well. It's *possible* this step *could*
be skipped in an emergency, because most of the information in
PuppetDB is a cache of exported resources, reports and facts. But it
could also break hosts and make converging the infrastructure
impossible, as there might be dependency loops in exported resources.

In particular, the Puppet server needs access to the LDAP server, and
that is configured in Puppet. So if the Puppet server needs to be
rebuilt from scratch, it will need to be manually allowed access to
the LDAP server to compile its manifest.

So it is strongly encouraged to restore the PuppetDB server database
as well in case of disaster.

This also applies in case of an IP address change of the Puppet
server, in which case access to the LDAP server needs to be manually
granted before the configuration can run and converge. This is a known
bootstrapping issue with the Puppet server and is further discussed in
the [design section](#ldap-integration).

# Reference

This documents generally how things are setup.

## Installation

Setting up a new Puppet server from scratch is not supported, or, to
be more accurate, would be somewhat difficult. The server expects
various external services to populate it with data, in particular:

 * it [fetches data from LDAP](#ldap-integration)
 * [Nagios generates the NRPE configuration](#nagios-integration)
 * the [letsencrypt repository manages the TLS certificates](#lets-encrypt-tls-certificates)

The auto-ca component is also deployed manual, and so are the git
hooks, repositories and permissions.

This needs to be documented, automated and improved. Ideally, it
should be possible to install a new Puppet server from scratch using
nothing but a Puppet bootstrap manifest, see [issue 30770][] and
[issue 29387][], along with [discussion about those improvements in
this page](#proposed-solution), for details.

[issue 30770]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30770

## SLA

No formal SLA is defined. Puppet runs on a fairly slow `cron` job so
doesn't have to be highly available right now. This could change in
the future if we rely more on it for deployments.

## Design

The Puppet master currently lives on `pauli`. That server
was setup in 2011 by weasel. It follows the configuration of the
Debian Sysadmin (DSA) Puppet server, which has its source code
available in the [dsa-puppet repository](https://salsa.debian.org/dsa-team/mirror/dsa-puppet/). 

PuppetDB, which was previously hosted on `pauli`, now runs on its own dedicated
machine `puppetdb-01`. Its configuration and PostgreSQL database are managed by
the `profile::puppetdb` and `role::puppetdb` class pair.

The service is maintained by TPA and manages *all* TPA-operated
machines. Ideally, all services are managed by Puppet, but
historically, only basic services were configured through Puppet,
leaving service admins responsible for deploying their services on top
of it. That tendency has shifted recently (~2020) with the deployment
of the [GitLab](gitlab) service through Puppet, for example.

The source code to the Puppet manifests (see below for a Glossary) is
managed through git on a repository hosted directly on the Puppet
server. Agents are deployed as part of the [install process](new-machine), and
talk to the central server using a Puppet-specific certificate
authority (CA).

As mentioned in the [installation section](#installation), the Puppet server
assumes a few components (namely [LDAP](ldap), [Nagios](nagios), [Let's
Encrypt](tls) and auto-ca) feed information into it. This is also
detailed in the sections below. In particular, Puppet acts as a
duplicate "source of truth" for some information about servers. For
example, LDAP has a "purpose" field describing what a server is for,
but Puppet also has the concept of a role, attributed through Hiera
(see [issue 30273][]). A similar problem exists with IP addresses and
user access control, in general.

[issue 30273]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30273

Puppet is generally considered stable, but the code base is somewhat
showing its age and has accumulated some technical debt.

For example, much of the Puppet code deployed is specific to Tor (and
DSA, to a certain extent) and therefore is only maintained by a
handful of people. It would be preferable to migrate to third-party,
externally maintained modules (e.g. [systemd](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33449), but also many
others, see [issue 29387][] for details). A similar problem exists
with custom Ruby code implemented for various functions, which is
being replaced with Hiera ([issue 30020][]).

The Puppet infrastructure being kept up to date with the latest
versions in Debian but will require some work to port to Puppet 6, as
the current deployment system ("puppetmaster") has been removed in
that new release (see [issue 33588][]).

[issue 33588]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/33588

### Glossary

This is a subset of the [Puppet glossary](https://puppet.com/docs/puppet/latest/glossary.html) to quickly get you
started with the vocabulary used in this document.

 * **Puppet node**: a machine (virtual or physical) running Puppet
 * **Manifest**: Puppet source code
 * **Catalog**: a set of compiled of Puppet source which gets applied
   on a **node** by a **Puppet agent**
 * **Puppet agents**: the Puppet program that runs on all nodes to
   apply manifests
 * **Puppet server**: the server which all **agents** connect to to
   fetch their **catalog**, also known as a **Puppet master** in older
   Puppet versions (pre-6)
 * **Facts**: information collected by Puppet agents on nodes, and
   exported to the Puppet server
 * **Reports**: log of changes done on nodes recorded by the Puppet
   server
 * **[PuppetDB](https://puppet.com/docs/puppetdb/) server**: an application server on top of a PostgreSQL
   database providing an [API](https://www.puppet.com/docs/puppetdb/7/api/overview) to query various resources like node
   names, facts, reports and so on

### File layout

The Puppet master runs on `pauli.torproject.org`. That is where the main git
repository (`tor-puppet`) lives, in
`/srv/puppet.torproject.org/git/tor-puppet`. That repository has hooks to
populate `/etc/puppet` which is the live checkout from which the Puppet server
compiles its catalogs.
  
All paths below are relative to the root of that git repository.

- `3rdparty/modules` include modules that are shared publicly and do
  not contain any TPO-specific configuration. There is a `Puppetfile`
  there that documents where each module comes from and that can be
  maintained with [r10k][] or [librarian][].

  [librarian]: https://librarian-puppet.com/
  [r10k]: https://github.com/puppetlabs/r10k/

- `modules` includes roles, profiles, and classes that make the bulk
  of our configuration.

- each node is assigned a "role" through the ENC, in
  `hiera-enc/nodes/$FQDN.yaml`

  To be more accurate, the ENC assigns top-scope `$role` variable to
  each node, which is in turn used to include a `role::$rolename`
  class on each node. This occurs in the default node definition in
  `manifests/site.pp`.

  Some nodes include a list of classes, inherited from the previous
  Hiera-based setup, but we're in the process of transitioning all
  nodes to single role classes, see [issue 40030][] for progress on
  this work.

[issue 40030]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40030

- The `torproject_org` module
  (`modules/torproject_org/manifests/init.pp`) performs basic host
  initialisation, like configuring Debian mirrors and APT sources,
  installing a base set of packages, configuring puppet and timezone,
  setting up a bunch of configuration files and running `ud-replicate`.

- There is also the `hoster.yaml` file
  (`modules/torproject_org/misc/hoster.yaml`) which defines hosting
  providers and specifies things like which network blocks they use,
  if they have a DNS resolver or a Debian mirror. `hoster.yaml` is read
  by
  - the `nodeinfo()` function
    (`modules/puppetmaster/lib/puppet/parser/functions/nodeinfo.rb`),
    used for setting up the `$nodeinfo` variable
  - `ferm`'s `def.conf` template (`modules/ferm/templates/defs.conf.erb`)

- The root of definitions and execution is in Puppet is found in
  the `manifests/site.pp` file. Its purpose is to include a role class
  for the node as well as a number of other classes which are common
  for all nodes.

Note that the above is the current state of the file hierarchy. As
part Hiera transition ([issue 30020][]), a lot of the above
architecture will change in favor of the more standard
[role/profile/module][] pattern.

Note that this layout might also change in the future with the
introduction of a role account ([issue 29663][]) and when/if the
repository is made public (which requires changing the layout).

See [ticket #29387][] for an in-depth discussion.

[issue 29387]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387
[role/profile/module]: https://puppet.com/docs/pe/2017.2/r_n_p_intro.html
[ticket #29387]: https://bugs.torproject.org/29387
[issue 30020]: https://bugs.torproject.org/30020

### Installed packages facts

The `modules/torproject_org/lib/facter/software.rb` file defines our
custom facts, making it possible to get answer to questions like "Is
this host running `apache2`?" by simply looking at a puppet
variable.

Those facts are deprecated and we should instead install packages
through Puppet instead of manually installing packages on hosts.

### Style guide

Puppet manifests should generally follow the [Puppet style
guide][]. This can be easily done with [Flycheck][] in Emacs,
[vim-puppet][], or a similar plugin in your favorite text editor.

Many files do not *currently* follow the style guide, as they
*predate* the creation of said guide. Files should *not* be completely
reformatted unless there's a good reason. For example, if a
conditional covering a large part of a file is removed and the file
needs to be re-indented, it's a good opportunity to fix style in the
file. Same if a file is split in two components or for some other
reason completely rewritten.

Otherwise the style already in use in the file should be followed.

[Puppet style guide]: https://puppet.com/docs/puppet/4.8/style_guide.html
[Flycheck]: http://flycheck.org/
[vim-puppet]: https://github.com/rodjek/vim-puppet

### External Node Classifier (ENC)

We use an External Node Classifier (or ENC for short) to classify
nodes in different roles but also assign them environments and other
variables. The way the ENC works is that the Puppet server requests
information from the ENC about a node before compiling its catalog.

The Puppet server pulls three elements about nodes from the ENC:

 * `environment` is the standard way to assign nodes to a Puppet
   environment. The default is `production` which is the only
   environment currently deployed.

 * `parameters` is a hash where each key is made available as a
   top-scope variable in a node's manifests. We use this assign a
   unique "role" to each node. The way this works is, for a given role
   `foo`, a class `role::foo` will be included. That class should only
   consist of a set of profile classes.

 * `classes` is an array of class names which Puppet includes on the
   target node. We are currently transitioning from this method of
   including classes on nodes (previously in Hiera) to the `role`
   parameter and unique role classes.

For a given node named `$fqdn`, these elements are defined in
`tor-puppet.git/hiera-enc/nodes/$fqdn.yaml`. Defaults can also be set
in `tor-puppet.git/hiera-enc/nodes/default.yaml`.

#### Role classes

Each host defined in the ENC declares which unique role it should be
attributed through the `parameter` hash. For example, this is what
configures a GitLab runner:

    parameters:
      - role: gitlab::runner

Roles should be *abstract* and *not* implementation specific. Each
role class includes a set of profiles which *are* implementation
specific. For example, the `monitoring` role includes
`profile::prometheus::server` and `profile::grafana`.

As a temporary exception to this rule, old modules can be included as
we transition from the Hiera mechanism, but eventually those should
be ported to shared modules from the Puppet forge, with our glue built
into a profile on top of the third-party module. The role
`role::gitlab` follows that pattern correctly. See [issue 40030][] for
progress on that work.

### Hiera

[Hiera][] is a "key/value lookup tool for configuration data" which
Puppet uses to look up values for class parameters and node
configuration in General.

We are in the process of transitioning over to this mechanism from our
previous set of custom YAML lookup system. This documents the way we
currently use Hiera.

[Hiera]: https://puppet.com/docs/hiera/

#### Common configuration

Class parameters which are common across several or all roles can be
defined in `hiera/common.yaml` to avoid duplication at the role level.

However, unless this parameter can be expected to change or evolve over
time, it's sometimes preferable to hardcode some parameters directly in
profile classes in order to keep this dataset from growing too much,
which can impact performance of the Puppet server and degrade its
readability. In other words, it's OK to place site-specific data in
profile manifests, as long as it may never or very rarely change.

These parameters can be override by role and node configurations.

#### Role configuration

Class parameters specific to a certain node role are defined in
`hiera/roles/${::role}.yaml`. This is the principal method by which we
configure the various profiles, thus shaping each of the roles we
maintain.

These parameters can be override by node-specific configurations.

#### Node configuration

On top of the role configuration, some node-specific configuration can
be performed from Hiera. This should be avoided as much as possible,
but sometimes there is just no other way. A good example was the
`build-arm-*` nodes which included the following configuration:

    bacula::client::ensure: "absent"

This disables backups on those machines, which are normally configured
everywhere. This is done because they are behind a firewall and
therefore not reachable, an unusual condition in the network. Another
example is `nutans` which sits behind a NAT so it doesn't know its own
IP address. To export proper firewall rules, the allow address has
been overridden as such:

    bind::secondary::allow_address: 89.45.235.22

Those types of parameters are normally automatically guess inside
modules' classes, but they are overridable from Hiera.

Note: eventually *all* host configuration will be done here, but there
are currently still some configurations hardcoded in individual
modules. For example, the Bacula director is hardcoded in the `bacula`
base class (in `modules/bacula/manifests/init.pp`). That should be
moved into a class parameter, probably in `common.yaml`.

### Cron and scheduling

The Puppet agent is *not* running as a daemon, it's running through
good old `cron`.

Puppet runs on each node every four hour, although with a random 2h
jitter, so the actual frequency is somewhere between 4 and 6
hours.

This configuration is in `/etc/cron.d/puppet-crontab` and deployed by
Puppet itself, currently as part of the `torproject_org` module.

### LDAP integration

The Puppet is configured to talk with Puppet through a few custom
functions defined in
`modules/puppetmaster/lib/puppet/parser/functions`. The main plumbing
function is called `ldapinfo()` and connects to the LDAP server
through `db.torproject.org` over TLS on port 636. It takes a hostname
as an argument and will load all hosts matching that pattern under the
`ou=hosts,dc=torproject,dc=org` subtree. If the specified hostname is
the `*` wildcard, the result will be a hash of `host => hash` entries,
otherwise only the `hash` describing the provided host will be
returned.

The `nodeinfo()` function uses that function to populate the global
`$nodeinfo` hash available globally, or, more specifically, the
`$nodeinfo['ldap']` component. It also loads the `$nodeinfo['hoster']`
value from the `whohosts()` function. That function, in turn, tries to
match the IP address of the host against the "hosters" defined in the
`hoster.yaml` file.

The `allnodeinfo()` function does a similar task as `nodeinfo()`,
except that it loads *all* nodes from LDAP, into a single hash. It
does *not* include the "hoster" and is therefore equivalent to calling
`nodeinfo()` on each host and extracting only the `ldap` member hash
(although it is not implemented that way).

Puppet does not require any special credentials to access the LDAP
server. It accesses the LDAP database anonymously, although there is a
firewall rule (defined in Puppet) that grants it access to the LDAP
server.

There is a bootstrapping problem there: if one would be to rebuild the
Puppet server, it would actually fail to compile its catalog because
it would not be able to connect to the LDAP server to fetch its
catalog, unless the LDAP server has been manually configured to let
the Puppet server through.

NOTE: much (if not all?) of this is being moved into Hiera, in
particular the YAML files. See [issue 30020](https://trac.torproject.org/projects/tor/ticket/30020) for details. Moving
the host information into Hiera would resolve the bootstrapping
issues, but would require, in turn some more work to resolve questions
like how users get granted access to individual hosts, which is
currently managed by `ud-ldap`. We cannot, therefore, simply move host
information from LDAP into Hiera without creating a duplicate source
of truth without rebuilding or tweaking the user distribution
system. See also the [LDAP design document](ldap#Design) for more information
about how LDAP works.

### Nagios integration

Nagios (which is really Icinga, but let's call it Nagios because
that's how it's called everywhere in the source) is hooked into Puppet
through an external sync system. Our [Nagios deployment](nagios) operates
through Git hooks which run a special `Makefile` that compiles and
deploys the Icinga configuration, but also compiles the client-side
NRPE configuration.

The NRPE configuration is generated on the Nagios server and then
pushed to the Puppet server with `rsync` over SSH, using a public key
distributed by Puppet from the `roles::puppetmaster` class. That key
has a restricted `command` field which limits access to the Puppet
manifest, in this single file:

    /etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg

This file then gets distributed to all nodes through the
`nagios::client` class using a simple `File` resource.

So when a Nagios check is added or changed, Puppet needs to run on all
the affected host for the check to take affect, on top of, of course,
adding the check into the Nagios git repository.

### Let's Encrypt TLS certificates

Public TLS certificates, as issued by Let's Encrypted, are distributed
by Puppet. Those certificates are generated by the "letsencrypt" Git
repository (see the [TLS documentation](tls) for details on that
workflow). The relevant part, as far as Puppet is concerned, is that
certificates magically end up in the following directory when a
certificate is issued or (automatically) renewed:

    /srv/puppet.torproject.org/from-letsencrypt

See also the [TLS deployment docs](tls#lets-encrypt-workflow) for how that directory gets
populated.

Normally, those files would not be available from the Puppet
manifests, but the `ssl` Puppet module uses a special trick whereby
those files are read by Puppet `.erb` templates. For example, this is
how `.crt` files get generated on the Puppet master, in
`modules/ssl/templates/crt.erb`:

    <%=
      fn = "/srv/puppet.torproject.org/from-letsencrypt/#{@name}.crt"
      out = File.read(fn)
      out
    %>

Similar templates exist for the other files.

Those certificates should not be confused with the "auto-ca" TLS
certificates in use internally and which are deployed directly in
`/etc/puppet/modules/ssl/files/`, see below.

### Internal auto-ca TLS certificates

The Puppet server also manages an internal CA which we informally call
"auto-ca". Those certificates are internal in that they are used to
authenticate nodes to each other, not to the public. They are used, for
example, to encrypt connections between mail servers (in Postfix) and
[backup servers](backup) (in Bacula).

The auto-ca deploys those certificates directly inside the Puppet
server checkout, in `/etc/puppet/modules/ssl/files/certs/` and
`.../clientcerts/`. Details of that system are available in the [TLS documentation](tls#internal-auto-ca).

## Issues

There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [team issue tracker][search] with the ~Puppet
label.

 [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
 [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues?label_name%5B%5D=Puppet

## Monitoring and testing

Puppet is hooked into Nagios in two ways:

 * one job runs on the Puppetmaster and checks PuppetDB for
   reports. this was done with a [patched](https://github.com/evgeni/check_puppetdb_nodes/pull/14) version of the
   [check_puppetdb_nodes](https://github.com/evgeni/check_puppetdb_nodes/) Nagios check, now packaged inside the
   `tor-nagios-checks` Debian package
 * the same job actually runs twice; once to check all manifests, and
   another to check each host individually and assign the result to
   the right how

The twin checks are present so that we can find stray Puppet hosts,
for example if a host was retired from Nagios but not retired from
Puppet, or added to Nagios but not Puppet.

Note that we exclude some errors from the logs because we've been
having intermittent failures with PuppetDB since the Debian 10.12
"buster" point release on March 26 (see [issue
tpo/tpa/team#40699](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40699)). We speculate this issue will go away when the
PuppetDB package is fixed ([tpo/tpa/team#40707](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40707)).

The `check_puppetdb_nodes` was originally [deployed in March
2019](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29676). An alternative check was the [check_puppet_agent](https://github.com/aswen/nagios-plugins/blob/master/check_puppet_agent) Nagios
check which has also recently (2022) been added to the
`tor-nagios.git` repository, but never actually used, as the puppetdb
check seems sufficient. It could, however, be used to replace the
above (to a certain extent) if we (for example) need to get rid of
PuppetDB for some reason.

An alternative implementation [using Prometheus](https://forge.puppet.com/puppet/prometheus_reporter) was considered but
[Prometheus still hasn't replaced Nagios](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864) at the time of writing.

There are no validation checks and *a priori* no peer review of code:
code is directly pushed to the Puppet server without validation. Work
is being done to [implement automated checks](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31226) but that is only
being deployed on the client side for now, and voluntarily. See the
[Validating Puppet code section](#validating-puppet-code) above.

## Logs and metrics

PuppetDB exposes a performance dashboard which is accessible via web. To reach
it, first establish an ssh forwarding to `puppetdb-01` on port 8080 as
described on this page, and point your browser at
http://localhost:8080/pdb/dashboard/index.html

PuppetDB itself also holds performance information about the Puppet agent runs,
which are called "reports". Those reports contain information about changes
operated on each server, how long the agent runs take and so on. Those metrics
could be made more visible by using a dashboard, but that has not been
implemented yet (see [issue 31969][]).

[issue 31969]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31969

The Puppet server, Puppet agents and PuppetDB keep logs of their
operations. The latter keeps its logs in `/var/log/puppetdb/` for a
maximum of 90 days or 1GB, whichever comes first (configured in
`/etc/puppetdb/request-logging.xml` and
`/etc/puppetdb/logback.xml`). The other logs are sent to `syslog`, and
usually end up in `daemon.log`.

Puppet should hold minimal personally identifiable information, like
user names, user public keys and project names.

## Other documentation

 * [Latest Puppet docs](https://puppet.com/docs/puppet/latest/puppet_index.html) - might be too new, see also the [Puppet
   5.5 docs](https://puppet.com/docs/puppet/5.5/puppet_index.html)
   * [Function reference](https://puppet.com/docs/puppet/latest/function.html)
   * [Type reference](https://puppet.com/docs/puppet/latest/type.html)
 * [Mapping between versions of Puppet Entreprise, Facter, Hiera, Agent, etc](https://puppet.com/docs/pe/2019.0/component_versions_in_recent_pe_releases.html)

# Discussion

This section goes more in depth into how Puppet is setup, why it was
setup the way it was, and how it could be improved.

## Overview

Our Puppet setup dates back from 2011, according to the git history,
and was probably based off the [Debian System Administrator's Puppet
codebase](https://salsa.debian.org/dsa-team/mirror/dsa-puppet) which dates back to 2009.

## Goals

The general goal of Puppet is to provide basic automation across the
architecture, so that software installation and configuration, file
distribution, user and some service management is done from a central
location, managed in a git repository. This approach is often called
[Infrastructure as code](https://en.wikipedia.org/wiki/Infrastructure_as_Code).

This section also documents possible improvements to our Puppet
configuration that we are considering.

### Must have

 * **secure**: only sysadmins should have access to push configuration,
   whatever happens. this includes deploying only audited and verified
   Puppet code into production.
 * **code review**: changes on servers should be verifiable by our peers,
   through a git commit log
 * **fix permissions issues**: deployment system should allow all admins
   to push code to the puppet server without having to constantly fix
   permissions (e.g. through a [role account](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29663))
 * **secrets handling**: there are some secrets in Puppet. those
   should remain secret.

We mostly have this now, although there are concerns about permissions
being wrong sometimes, which a role account could fix.

### Nice to have

Those are mostly issues with the current architecture we'd like to fix:

 * **Continuous Integration**: before deployment, code should be vetted by
   a peer and, ideally, automatically checked for errors and tested
 * **single source of truth**: when we add/remove nodes, we should not
   have to talk to multiple services (see also the [install automation
   ticket](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31239) and the [new-machine discussion](new-machine#discussion)
 * **collaboration** with other sysadmins outside of TPA, for which we
   would need to...
 * ... **publicize our code** (see [ticket 29387](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387))
 * **no manual changes**: every change on every server should be committed
   to version control somewhere
 * **bare-metal recovery**: it should be possible to recover a service's
   *configuration* from a bare Debian install with Puppet (and with
   data from the [backup](backup) service of course...)
 * **one commit only**: we shouldn't have to commit "twice" to get
   changes propagated (once in a submodule, once in the parent module,
   for example)

### Non-Goals

 * **ad hoc changes** to the infrastructure. one-off jobs should be
   handled by [fabric](fabric), Cumin, or straight SSH.

## Approvals required

TPA should approve policy changes as per [tpa-rfc-1](/policy/tpa-rfc-1-policy).

## Proposed Solution

To improve on the above "Goals", I would suggest the following
configuration.

TL;DR:

 0. publish our repository (tpo/tpa/team#29387)
 1. Use a control repository
 2. Get rid of `3rdparty`
 3. Deploy with `g10k`
 4. Authenticate with checksums
 5. Deploy to branch-specific environments (tpo/tpa/team#40861)
 6. Rename the default branch "production"
 7. Push directly on the Puppet server
 8. Use a role account (tpo/tpa/team#29663)
 9. Use local test environments
 10. Develop a test suite
 11. Hook into CI
 12. OpenPGP verification and web hook

Steps 1-8 could be implemented without too much difficulty and should
be a mid term objective. Steps 9 to 12 require significantly more work
and could be implemented once the new infrastructure stabilizes.

What follows is an explanation and justification of each step.

### Publish our repository

Right now our Puppet repository is private, because there's
sensitive information in there. The goal of this step is to make sure
we can safely publish our repository without risking disclosing
secrets.

Secret data is currently stored in Trocla, and we should keep using it
for that purpose. That would avoid having to mess around splitting the
repository in multiple components in the short term.

This is the data that needs to be moved into Trocla at the time of writing:

 * `modules/postfix/files/virtual` - email addresses
 * `modules/postfix/files/access-1-sender-reject` and related - email addresses
 * sudoers configurations?
 * secrets in /etc/puppet (hopefully not in git, but just in case)

A full audit should be redone before this is completed.

### Use a control repository

The base of the infrastructure is a [control-repo](https://puppet.com/docs/pe/latest/control_repo.html) ([example](https://github.com/puppetlabs/control-repo),
[another more complex example](https://github.com/example42/psick))
which chain-loads all the other modules. This implies turning all our
"modules" into "profiles" and moving "real" modules (which are fit for
public consumption) "outside", into public repositories (see also
[issue 29387: publish our puppet repository](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387)).

Note that the control repository *could* also be public: we could
simply have the private data inside of Hiera or some other private
repository.

The control repository concept is specific to the proprietary version
of Puppet (Puppet Enterprise or PE) but its logic should be usable
with the open source Puppet release as well.

### Get rid of 3rdparty

The control repo's core configuration file is the `Puppetfile`. We
already use a Puppetfile, but only to manage modules inside of the
`3rdparty` directory. Now it would manage *all* modules, or, more
specifically, `3rdparty` would become the default `modules` directory
which would, incidentally, encourage us to upstream our modules and
publish them to the world.

Our current `modules` directory would move into `site-modules`, which
is the designated location for "roles, profiles, and custom
modules". This has been suggested before in [issue 29387: publish our
puppet repository](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387)) and is important for the `Puppetfile` to do its
job.

In other words, this is the checklist:

 * [x] convert everything to hiera (tpo/tpa/team#30020) - this
       requires creating `roles` for each machine (more or less) --
       effectively done as far as this issue is concerned
 * [ ] sanitize repository (tpo/tpa/team#29387)
 * [ ] move `3rdparty` modules into `modules/`

Once this is done, the final picture will look like this in `/etc/puppet`:

 * `hiera/` - private data. `machine -> role` assignments, secret
   stuff like the alias file, machine location, price and other
   similar metadata and details (see also legacy/trac#29816)
   
 * `modules/` - equivalent of the current `3rdparty/` directory: fully
   public, reusable code that's aimed at collaboration. mostly code
   from the Puppet forge or our own repository if no equivalent there

 * `site-modules/profiles/` - magic sauce on top of 3rd party
   `modules/`, already created a few `modules/profiles/` for grafana
   and prometheus, the profiles configure official 3rd party classes
   with our site-specific criteria

 * `site-modules/roles/` - abstract classes that regroup a few
   profiles. for example `roles::monitoring` could currently include
   `profiles::nagiosmaster`, `profiles::prometheus::server` and
   `profiles::grafana` as an implementation

 * `site-modules/MODULE/` - remaining custom modules that still need
   to be published (by moving in their own repository and `modules/`
   or by replacing with an existing module in `modules/`

This could all be done in the current repository, without creating a
new clean history one, but it would prepare us for that final
step. And that step would simply be to move `modules/`, `profiles/`,
and `roles/` into a public repository, while keeping `hiera/` private
in its own repository.

### Deploy with g10k

It seems clear that everyone is converging over the use of a
`Puppetfile` to deploy code. While there are still monorepos out
there, but they do make our life harder, especially when we need to
operate on non-custom modules.