Newer
Older
To view limit logs to the last day only:
journalctl -t puppet-agent --since=-1d
### Running Puppet by hand and logging
When a Puppet manifest is not behaving as it should, the first step is
to run it by hand on the host:
If that doesn't yield enough information, you can see pretty much
everything that Puppet does with the `--debug` flag. This will, for
example, include `Exec` resources `onlyif` commands and allow you to
see why they do not work correctly (a common problem):
Finally, some errors show up only on the Puppet server: you can look in
`/var/log/daemon.log` there for errors that will only show up there.
### Finding source of exported resources
Debugging exported resources can be hard since errors are reported by the puppet
agent that's collecting the resources but it's not telling us what host exported
the resource that's in conflict.
To get further information, we can poke around the underlying database or we can
ask PuppetDB.
#### with SQL queries
Connecting to the PuppetDB database itself can sometimes be easier
than trying to operate the API. There you can inspect the entire thing
as a normal SQL database, use this to connect:
sudo -u postgres psql puppetdb
It's possible exported resources do surprising things sometimes. It is
useful to look at the actual PuppetDB to figure out which tags
exported resources have. For example, this query lists all exported
resources with `troodi` in the name:
SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE exported = 't' AND title LIKE '%troodi%';
Keep in mind that there are [automatic tags](https://puppet.com/docs/puppet/6.4/lang_tags.html) in exported resources
which can complicate things.
#### with PuppetDB
This query will look for exported resources with the `type`
`Bacula::Director::Client` (which can be a class, define, or builtin resource)
and match a `title` (the unique "name" of the resource as defined in the
manifests), like in the above SQL example, that contains `troodi`:
curl -s -X POST http://localhost:8080/pdb/query/v4 \
-H 'Content-Type:application/json' \
-d '{"query": "resources { exported = true and type = \"Bacula::Director::Client\" and title ~ \".*troodi.*\" }"}' \
| jq . | less -SR
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
### Finding all instances of a deployed resource
Say you want to [deprecate cron](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41303). You want to see where the `Cron`
resource is used to understand how hard of a problem this is.
This will show you the resource titles and how many instances of each
there are:
SELECT count(*),title FROM catalog_resources WHERE type = 'Cron' GROUP BY title ORDER by count(*) DESC;
Example output:
puppetdb=# SELECT count(*),title FROM catalog_resources WHERE type = 'Cron' GROUP BY title ORDER by count(*) DESC;
count | title
-------+---------------------------------
87 | puppet-cleanup-clientbucket
81 | prometheus-lvm-prom-collector-
9 | prometheus-postfix-queues
6 | docker-clear-old-images
5 | docker-clear-nightly-images
5 | docker-clear-cache
5 | docker-clear-dangling-images
2 | collector-service
2 | onionoo-bin
2 | onionoo-network
2 | onionoo-service
2 | onionoo-web
2 | podman-clear-cache
2 | podman-clear-dangling-images
2 | podman-clear-nightly-images
2 | podman-clear-old-images
1 | update rt-spam-blocklist hourly
1 | update torexits for apache
1 | metrics-web-service
1 | metrics-web-data
1 | metrics-web-start
1 | metrics-web-start-rserve
1 | metrics-network-data
1 | rt-externalize-attachments
1 | tordnsel-data
1 | tpo-gitlab-backup
1 | tpo-gitlab-registry-gc
1 | update KAM ruleset
(28 rows)
A more exhaustive list of each resource and where it's declared:
SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE type = 'Cron';
Which host uses which resource:
SELECT certname,title FROM catalog_resources JOIN certnames ON certname_id=certnames.id WHERE type = 'Cron' ORDER BY certname;
Top 10 hosts using the resource:
puppetdb=# SELECT certname,count(title) FROM catalog_resources JOIN certnames ON certname_id=certnames.id WHERE type = 'Cron' GROUP BY certname ORDER BY count(title) DESC LIMIT 10;
certname | count
-----------------------------------+-------
meronense.torproject.org | 7
forum-01.torproject.org | 7
ci-runner-x86-02.torproject.org | 7
onionoo-backend-01.torproject.org | 6
onionoo-backend-02.torproject.org | 6
dangerzone-01.torproject.org | 6
btcpayserver-02.torproject.org | 6
chi-node-14.torproject.org | 6
rude.torproject.org | 6
minio-01.torproject.org | 6
(10 rows)
It can sometimes be useful to examine a node's catalog in order to
determine if certain resources are present, or to view a resource's
full set of parameters.
To list all `service` resources managed by Puppet on a node, the
command below may be executed on the node itself:
puppet catalog select --terminus rest "$(hostname -f)" service
At the end of the command line, `service` may be replaced by any
built-in resource types such as `file` or `cron`. Defined resource
names may also be used here, like `ssl::service`.
To extract a node's full catalog in JSON format:
puppet catalog find --terminus rest "$(hostname -f)"
The output can be manipulated using `jq` to extract more precise
information. For example, to list all resources of a specific type:
jq '.resources[] | select(.type == "File") | .title' < catalog.json
To list all classes in the catalog:
jq '.resources[] | select(.type=="Class") | .title' < catalog.json
To display a specific resource selected by title:
jq '.resources[] | select((.type == "File") and (.title=="sources.list.d"))' < catalog.json
More examples can be found on this [blog post](http://web.archive.org/web/20210122003128/https://alexharv074.github.io/puppet/2017/11/30/jq-commands-for-puppet-catalogs.html).OB
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
### Examining agent reports
If you want to look into what agent run errors happened previously, for example
if there were errors during the night but that didn't reoccur on subsequent
agent runs, you can use PuppetDB's capabilities of storing and querying agent
reports, and then use jq to find out the information you're looking for in the
report(s).
In this example, we'll first query for reports and save the output to a file.
We'll then filter the file's contents with jq. This approach can let you search
for more details in the report more efficiently, but don't forget to remove the
file once you're done.
Here we're grabbing the reports for the host `pauli.torproject.org` where there
were changes done, after a set date -- we're expecting to get only one report as
a result, but that might differ when you run the query:
curl -s -X POST http://localhost:8080/pdb/query/v4 \
-H 'Content-Type:application/json' \
-d '{"query": "reports { certname = \"pauli.torproject.org\" and start_time > \"2024-10-28T00:00:00.000Z\" and status = \"changed\" }" }' \
> pauli_catalog_what_changed.json
Note that the date format above needs to look like what's above, otherwise you
might get a very non-descriptive error like
`parse error: Invalid numeric literal at line 1, column 12`
With the report in the file on disk, we can query for certain details.
To see what puppet did during the run:
jq .[].logs.data pauli_catalog_what_changed.json
For more information about what information is available in reports, check out
the [resource endpoint documentation](https://www.puppet.com/docs/puppetdb/8/api/query/v4/reports).
A Prometheus `PuppetCatalogStale` error looks like this:
Stale Puppet catalog on test.torproject.org
One of the following is happening, in decreasing likeliness:
1. the node's Puppet manifest has an error of some sort that makes it
impossible to run the catalog
2. the node is down and has failed to report since the last time
specified
3. the node was retired but the monitoring or puppet server doesn't
know
3. the Puppet **server** is down and **all** nodes will fail to
report in the same way (in which case a lot more warnings will
show up, and other warnings about the server will come in)
The first situation will usually happen after someone pushed a commit
introducing the error. We try to keep all manifests compiling all the
time and such errors should be immediately fixed. Look at the history
of the Puppet source tree and try to identify the faulty
commit. Reverting such a commit is acceptable to restore the service.
The second situation can happen if a node is in maintenance for an
extended duration. Normally, the node will recover when it goes back
online. If a node is to be permanently retired, it should be removed
from Puppet, using the [host retirement procedures](howto/retire-a-host).
The third situation should not normally occur: when a host is retired
following the [retirement procedure](howto/retire-a-host), it's also retired from
Puppet. That should normally clean up everything, but reports
generated by the [Puppet reporter][] do actually stick around for 7
extra days. There's now a silence in the retirement procedure to hide
those alerts, but they will still be generated on host retirements.
Finally, if the main Puppet **server** is down, it should definitely
be brought back up. See disaster recovery, below.
In any case, running the Puppet agent on the affected node should give
more information:
ssh NODE puppet agent -t
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
The Puppet metrics are generated by the [Puppet reporter][], which is
a plugin deployed on the Puppet server (currently `pauli`) which
accepts reports from nodes and writes metrics in the node exporter's
"`textfile` collector" directory
(`/var/lib/prometheus/node-exporter/`). You can, for example, see the
metrics for the host `idle-fsn-01` like this:
```
root@pauli:~# cat /var/lib/prometheus/node-exporter/idle-fsn-01.torproject.org.prom
# HELP puppet_report Unix timestamp of the last puppet run
# TYPE puppet_report gauge
# HELP puppet_transaction_completed transaction completed status of the last puppet run
# TYPE puppet_transaction_completed gauge
# HELP puppet_cache_catalog_status whether a cached catalog was used in the run, and if so, the reason that it was used
# TYPE puppet_cache_catalog_status gauge
# HELP puppet_status the status of the client run
# TYPE puppet_status gauge
# Old metrics
# New metrics
puppet_report{environment="production",host="idle-fsn-01.torproject.org"} 1731076367.657
puppet_transaction_completed{environment="production",host="idle-fsn-01.torproject.org"} 1
puppet_cache_catalog_status{state="not_used",environment="production",host="idle-fsn-01.torproject.org"} 1
puppet_cache_catalog_status{state="explicitly_requested",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_cache_catalog_status{state="on_failure",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="failed",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="changed",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="unchanged",environment="production",host="idle-fsn-01.torproject.org"} 1
```
If something is off between reality and what the monitoring system
thinks, this file should be inspected for validity, and its timestamp
checked. Normally, those files should be updated every time the node
runs a catalog, for example.
Expired nodes should disappear from that directory after 7 days,
defined in `/etc/puppet/prometheus.yaml`. The reporter is hooked in
the Puppet server through the `/etc/puppet/puppet.conf` file, with the
following line:
```
[master]
# ...
reports = puppetdb,prometheus
```
See also issue [#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639) for notes on the deployment of that
monitoring tool.
[Puppet reporter]: https://github.com/voxpupuli/puppet-prometheus_reporter
Note that this used to be monitored through Icinga before its
retirement, and, until it's fully retired, you might also see this
error creep up instead:
Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z
The playbook here is unchanged.
### Problems pushing to the Puppet server
If you get this error when pushing commits to the Puppet server:
error: remote unpack failed: unable to create temporary object directory
anarcat@curie:tor-puppet$ LANG=C git push
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 772 bytes | 772.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0), pack-reused 0
error: remote unpack failed: unable to create temporary object directory
To puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet
! [remote rejected] master -> master (unpacker error)
error: failed to push some refs to 'puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet'
anarcat@curie:tor-puppet[1]$
It's because you're not using the `git` role account. Update your
remote URL configuration to use `git@puppet.torproject.org` instead,
with:
git remote set-url origin git@puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet.git
This is because we have switched to a role user for pushing changes to
the Git repository, see [issue 29663][] for details.
[issue 29663]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29663
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
### Error: The CRL issued by 'CN=Puppet CA: pauli.torproject.org' has expired
This error causes the Puppet agent to abort its runs.
Check the expiry date for the Puppet CRL file at `/var/lib/puppet/crl.pem`:
cumin '*' 'openssl crl -in /var/lib/puppet/ssl/crl.pem -text | grep "Next Update"'
If the date is in the past, the node won't be able to get a catalog from the
Puppet server.
An up-to-date CRL may be retrieved from the Puppet server and installed as such:
curl --silent --cert /var/lib/puppet/ssl/certs/$(hostname -f).pem \
--key /var/lib/puppet/ssl/private_keys/$(hostname -f).pem \
--cacert /var/lib/puppet/ssl/certs/ca.pem \
--output /var/lib/puppet/ssl/crl.pem \
"https://puppet:8140/puppet-ca/v1/certificate_revocation_list/ca?environment=production"
TODO: shouldn't the Puppet agent be updating the CRL on its own?
### Puppet server CA renewal
TODO: no procedure established yet, some thoughts:
https://dev.to/betadots/extending-puppet-ca-38l8
The `installer/puppet-bootstrap-client` in `fabric-tasks.git` must also be
updated.
This is not expected to happen before year 2039.
### Failed systemd units on hosts
To check out what's happening with failed systemd units on a host:
systemctl --failed
You can, of course, run this check on all servers with [Cumin](howto/cumin):
cumin '*' 'systemctl --failed'
If you need further information you can dive into the logs of the units reported
by the command above:
journalctl -xeu failed-unit.service
Ideally, the main Puppet server would be deployable from Puppet
bootstrap code and the [main installer](new-machine). But in practice, much of
its configuration was done manually over the years and it MUST be
restored from [backups](backup) in case of failure.
This probably includes a restore of the [PostgreSQL](postgresql) database
backing the PuppetDB server as well. It's *possible* this step *could*
be skipped in an emergency, because most of the information in
PuppetDB is a cache of exported resources, reports and facts. But it
could also break hosts and make converging the infrastructure
impossible, as there might be dependency loops in exported resources.
In particular, the Puppet server needs access to the LDAP server, and
that is configured in Puppet. So if the Puppet server needs to be
rebuilt from scratch, it will need to be manually allowed access to
the LDAP server to compile its manifest.
So it is strongly encouraged to restore the PuppetDB server database
as well in case of disaster.
This also applies in case of an IP address change of the Puppet
server, in which case access to the LDAP server needs to be manually
granted before the configuration can run and converge. This is a known
bootstrapping issue with the Puppet server and is further discussed in
the [design section](#ldap-integration).
# Reference
This documents generally how things are setup.
Setting up a new Puppet server from scratch is not supported, or, to
be more accurate, would be somewhat difficult. The server expects
various external services to populate it with data, in particular:
* it [fetches data from LDAP](#ldap-integration)
* the [letsencrypt repository manages the TLS certificates](#lets-encrypt-tls-certificates)
The auto-ca component is also deployed manual, and so are the git
hooks, repositories and permissions.
This needs to be documented, automated and improved. Ideally, it
should be possible to install a new Puppet server from scratch using
nothing but a Puppet bootstrap manifest, see [issue 30770][] and
[issue 29387][], along with [discussion about those improvements in
this page](#proposed-solution), for details.
[issue 30770]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30770
### Puppetserver gems
Our Puppet Server deployment depends on two important Ruby gems: `trocla`, for
secrets management, and `net-ldap` for LDAP data retrieval, for example via our
`nodeinfo()` custom Puppet function.
Puppet Server 7 and later rely on JRuby and an isolated Rubygems environment,
so we can't simply install them using Debian packages. Instead, we need to
use the `puppetserver gem` command to manually install the gems:
puppetserver gem install net-ldap trocla --no-doc
Then restart `puppetserver.service`.
Starting from `trixie`, the `trocla-puppetserver` package will be available to
replace this manual deployment of the `trocla` gem.
## Upgrades
Puppet upgrades can be involved, as backwards compatibility between
releases is not always maintained. Worse, newer releases are not
always packaged in Debian. TPA, and @lavamind in particular, worked
really hard to package the Puppet 7 suite to Debian, which finally
shipped in Debian 12 ("bookworm"). Lavamind also packaged Puppet 8 for
trixie.
See [issue 33588][] for the background on this.
[issue 33588]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/33588
No formal SLA is defined. Puppet runs on a fairly slow `cron` job so
doesn't have to be highly available right now. This could change in
the future if we rely more on it for deployments.
## Design
The Puppet master currently lives on `pauli`. That server
was setup in 2011 by weasel. It follows the configuration of the
Debian Sysadmin (DSA) Puppet server, which has its source code
available in the [dsa-puppet repository](https://salsa.debian.org/dsa-team/mirror/dsa-puppet/).
PuppetDB, which was previously hosted on `pauli`, now runs on its own dedicated
machine `puppetdb-01`. Its configuration and PostgreSQL database are managed by
the `profile::puppetdb` and `role::puppetdb` class pair.
The service is maintained by TPA and manages *all* TPA-operated
machines. Ideally, all services are managed by Puppet, but
historically, only basic services were configured through Puppet,
leaving service admins responsible for deploying their services on top
of it. That tendency has shifted recently (~2020) with the deployment
of the [GitLab](gitlab) service through Puppet, for example.
The source code to the Puppet manifests (see below for a Glossary) is
managed through git on a repository hosted directly on the Puppet
server. Agents are deployed as part of the [install process](new-machine), and
talk to the central server using a Puppet-specific certificate
authority (CA).
As mentioned in the [installation section](#installation), the Puppet server
assumes a few components (namely [LDAP](ldap), [Let's Encrypt](tls) and
auto-ca) feed information into it. This is also detailed in the sections below.
In particular, Puppet acts as a duplicate "source of truth" for some
information about servers. For example, LDAP has a "purpose" field describing
what a server is for, but Puppet also has the concept of a role, attributed
through Hiera (see [issue 30273][]). A similar problem exists with IP addresses
and user access control, in general.
[issue 30273]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30273
Puppet is generally considered stable, but the code base is somewhat
showing its age and has accumulated some technical debt.
For example, much of the Puppet code deployed is specific to Tor (and
DSA, to a certain extent) and therefore is only maintained by a
handful of people. It would be preferable to migrate to third-party,
externally maintained modules (e.g. [systemd](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33449), but also many
others, see [issue 29387][] for details). A similar problem exists
with custom Ruby code implemented for various functions, which is
being replaced with Hiera ([issue 30020][]).
### Glossary
This is a subset of the [Puppet glossary](https://puppet.com/docs/puppet/latest/glossary.html) to quickly get you
started with the vocabulary used in this document.
* **Puppet node**: a machine (virtual or physical) running Puppet
* **Manifest**: Puppet source code
* **Catalog**: a set of compiled of Puppet source which gets applied
on a **node** by a **Puppet agent**
* **Puppet agents**: the Puppet program that runs on all nodes to
apply manifests
* **Puppet server**: the server which all **agents** connect to to
fetch their **catalog**, also known as a **Puppet master** in older
Puppet versions (pre-6)
* **Facts**: information collected by Puppet agents on nodes, and
exported to the Puppet server
* **Reports**: log of changes done on nodes recorded by the Puppet
server
* **[PuppetDB](https://puppet.com/docs/puppetdb/) server**: an application server on top of a PostgreSQL
database providing an [API](https://www.puppet.com/docs/puppetdb/7/api/overview) to query various resources like node
The Puppet server runs on `pauli.torproject.org`.
Two bare-mode git repositories live on this server, below
`/srv/puppet.torproject.org/git`:
- `tor-puppet-hiera-enc.git`, the external node classifier (ENC) code and data.
This repository has a hook that deploys to `/etc/puppet/hiera-enc`. See the
- `tor-puppet.git`, the puppet environments, also referred to as the "control
repository". Contains the puppet modules and data. That repository has a
hook that deploys to `/etc/puppet/code/environments`. See the "Environments"
#### External node classifier
Before catalog compilation occurs, each node is assigned an environment
(`production`, by default) and a "role" through the ENC, which is configured
using the `tor-puppet-hiera-enc.git` repository. The node definitions at
`nodes/$FQDN.yaml` are merged with the defaults defined in
`nodes/default.yaml`.
To be more accurate, the ENC assigns top-scope `$role` variable to each node,
which is in turn used to include a `role::$rolename` class on each node. This
occurs in the default node definition in `manifests/site.pp` in
`tor-puppet.git`.
Some nodes include a list of classes, inherited from the previous Hiera-based
setup, but we're in the process of transitioning all nodes to single role
classes, see [issue 40030][] for progress on this work.
[issue 40030]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40030
#### Environments
Environments on the Puppet Server are managed using `tor-puppet.git` which is
our "control repository". Each branch on this repo is mapped to an environment
on the server which takes the name of the branch, with the exception of `main`,
which is mapped to the default environment `production`.
This deployment is orchestrated using a git `pre-receive` hook that's managed
via the `profile::puppet::server` class and the `puppet` module.
In order to test a new branch/environment on a Puppet node after being pushed
to the control repository, additional configuration needs to be done in
`tor-puppet-hiera-enc.git` to specify which node(s) should use the test
environment instead of `production`. This is done by editing the
`nodes/<name>.yaml` file and adding an `environment:` key at the document root.
Once the environment is not needed anymore, the changes to the ENC should be
reverted before the branch is deleted on the control repo using `git push
--delete <branch>`. The git hook will take care of cleaning up the environment
files under `/etc/puppet/code/environments`.
It should be noted that contrary to hiera data and modules, [exported
resources](#exported-resources) are not confined by environments. Rather, they
all shared among all nodes regadless of their assigned environment.
The environments themselves are structured as follows. All paths are relative
to the root of that git repository.
- `3rdparty/modules` include modules that are shared publicly and do
not contain any TPO-specific configuration. There is a `Puppetfile`
there that documents where each module comes from and that can be
maintained with [r10k][] or [librarian][].
[librarian]: https://librarian-puppet.com/
[r10k]: https://github.com/puppetlabs/r10k/
- `modules` includes roles, profiles, and classes that make the bulk
of our configuration.
- The `torproject_org` module
(`modules/torproject_org/manifests/init.pp`) performs basic host
initialisation, like configuring Debian mirrors and APT sources,
installing a base set of packages, configuring puppet and timezone,
setting up a bunch of configuration files and running `ud-replicate`.
- There is also the `hoster.yaml` file
(`modules/torproject_org/misc/hoster.yaml`) which defines hosting
providers and specifies things like which network blocks they use,
if they have a DNS resolver or a Debian mirror. `hoster.yaml` is read
by
- the `nodeinfo()` function
(`modules/puppetmaster/lib/puppet/parser/functions/nodeinfo.rb`),
used for setting up the `$nodeinfo` variable
- `ferm`'s `def.conf` template (`modules/ferm/templates/defs.conf.erb`)
the `manifests/site.pp` file. Its purpose is to include a role class
for the node as well as a number of other classes which are common
for all nodes.
Note that the above is the current state of the file hierarchy. As
part Hiera transition ([issue 30020][]), a lot of the above
architecture will change in favor of the more standard
Note that this layout might also change in the future with the
introduction of a role account ([issue 29663][]) and when/if the
repository is made public (which requires changing the layout).
See [ticket #29387][] for an in-depth discussion.
[issue 29387]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387
[role/profile/module]: https://puppet.com/docs/pe/2017.2/r_n_p_intro.html
[ticket #29387]: https://bugs.torproject.org/29387
[issue 30020]: https://bugs.torproject.org/30020
### Installed packages facts
The `modules/torproject_org/lib/facter/software.rb` file defines our
custom facts, making it possible to get answer to questions like "Is
this host running `apache2`?" by simply looking at a puppet
Those facts are deprecated and we should instead install packages
through Puppet instead of manually installing packages on hosts.
Puppet manifests should generally follow the [Puppet style
guide][]. This can be easily done with [Flycheck][] in Emacs,
[vim-puppet][], or a similar plugin in your favorite text editor.
Many files do not *currently* follow the style guide, as they
*predate* the creation of said guide. Files should *not* be completely
reformatted unless there's a good reason. For example, if a
conditional covering a large part of a file is removed and the file
needs to be re-indented, it's a good opportunity to fix style in the
file. Same if a file is split in two components or for some other
reason completely rewritten.
Otherwise the style already in use in the file should be followed.
[Puppet style guide]: https://puppet.com/docs/puppet/4.8/style_guide.html
[Flycheck]: http://flycheck.org/
[vim-puppet]: https://github.com/rodjek/vim-puppet
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
### External Node Classifier (ENC)
We use an External Node Classifier (or ENC for short) to classify
nodes in different roles but also assign them environments and other
variables. The way the ENC works is that the Puppet server requests
information from the ENC about a node before compiling its catalog.
The Puppet server pulls three elements about nodes from the ENC:
* `environment` is the standard way to assign nodes to a Puppet
environment. The default is `production` which is the only
environment currently deployed.
* `parameters` is a hash where each key is made available as a
top-scope variable in a node's manifests. We use this assign a
unique "role" to each node. The way this works is, for a given role
`foo`, a class `role::foo` will be included. That class should only
consist of a set of profile classes.
* `classes` is an array of class names which Puppet includes on the
target node. We are currently transitioning from this method of
including classes on nodes (previously in Hiera) to the `role`
parameter and unique role classes.
For a given node named `$fqdn`, these elements are defined in
`tor-puppet-hiera-enc.git/nodes/$fqdn.yaml`. Defaults can also be set
in `tor-puppet-hiera-enc.git/nodes/default.yaml`.
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
#### Role classes
Each host defined in the ENC declares which unique role it should be
attributed through the `parameter` hash. For example, this is what
configures a GitLab runner:
parameters:
- role: gitlab::runner
Roles should be *abstract* and *not* implementation specific. Each
role class includes a set of profiles which *are* implementation
specific. For example, the `monitoring` role includes
`profile::prometheus::server` and `profile::grafana`.
As a temporary exception to this rule, old modules can be included as
we transition from the Hiera mechanism, but eventually those should
be ported to shared modules from the Puppet forge, with our glue built
into a profile on top of the third-party module. The role
`role::gitlab` follows that pattern correctly. See [issue 40030][] for
progress on that work.
[Hiera][] is a "key/value lookup tool for configuration data" which
Puppet uses to look up values for class parameters and node
configuration in General.
We are in the process of transitioning over to this mechanism from our
previous set of custom YAML lookup system. This documents the way we
currently use Hiera.
#### Common configuration
Class parameters which are common across several or all roles can be
defined in `hiera/common.yaml` to avoid duplication at the role level.
However, unless this parameter can be expected to change or evolve over
time, it's sometimes preferable to hardcode some parameters directly in
profile classes in order to keep this dataset from growing too much,
which can impact performance of the Puppet server and degrade its
readability. In other words, it's OK to place site-specific data in
profile manifests, as long as it may never or very rarely change.
These parameters can be override by role and node configurations.
#### Role configuration
Class parameters specific to a certain node role are defined in
`hiera/roles/${::role}.yaml`. This is the principal method by which we
configure the various profiles, thus shaping each of the roles we
maintain.
These parameters can be override by node-specific configurations.
On top of the role configuration, some node-specific configuration can
be performed from Hiera. This should be avoided as much as possible,
but sometimes there is just no other way. A good example was the
`build-arm-*` nodes which included the following configuration:
bacula::client::ensure: "absent"
This disables backups on those machines, which are normally configured
everywhere. This is done because they are behind a firewall and
therefore not reachable, an unusual condition in the network. Another
example is `nutans` which sits behind a NAT so it doesn't know its own
IP address. To export proper firewall rules, the allow address has
bind::secondary::allow_address: 89.45.235.22
Those types of parameters are normally automatically guess inside
modules' classes, but they are overridable from Hiera.
Note: eventually *all* host configuration will be done here, but there
are currently still some configurations hardcoded in individual
modules. For example, the Bacula director is hardcoded in the `bacula`
base class (in `modules/bacula/manifests/init.pp`). That should be
moved into a class parameter, probably in `common.yaml`.
Although Puppet supports running the agent as a daemon, our agent runs are
handled by a systemd timer/service unit pair: `puppet-run.timer` and
`puppet-run.service`. These are managed via the `profile::puppet` class and the
`puppet` module.
The runs are executed every 4 hours, with a random (but fixed per
host, using `FixedRandomDelay`) 4 hour delay to spread the runs across
the fleet.
Because the additional delay is fixed, changes should propagate to the
entire Puppet fleet within 4 hours. A Prometheus alert
(`PuppetCatalogStale`) will raise an alarm for hosts that have not run
for more than 24 hours.
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
The Puppet is configured to talk with Puppet through a few custom
functions defined in
`modules/puppetmaster/lib/puppet/parser/functions`. The main plumbing
function is called `ldapinfo()` and connects to the LDAP server
through `db.torproject.org` over TLS on port 636. It takes a hostname
as an argument and will load all hosts matching that pattern under the
`ou=hosts,dc=torproject,dc=org` subtree. If the specified hostname is
the `*` wildcard, the result will be a hash of `host => hash` entries,
otherwise only the `hash` describing the provided host will be
returned.
The `nodeinfo()` function uses that function to populate the global
`$nodeinfo` hash available globally, or, more specifically, the
`$nodeinfo['ldap']` component. It also loads the `$nodeinfo['hoster']`
value from the `whohosts()` function. That function, in turn, tries to
match the IP address of the host against the "hosters" defined in the
`hoster.yaml` file.
The `allnodeinfo()` function does a similar task as `nodeinfo()`,
except that it loads *all* nodes from LDAP, into a single hash. It
does *not* include the "hoster" and is therefore equivalent to calling
`nodeinfo()` on each host and extracting only the `ldap` member hash
(although it is not implemented that way).
Puppet does not require any special credentials to access the LDAP
server. It accesses the LDAP database anonymously, although there is a
firewall rule (defined in Puppet) that grants it access to the LDAP
There is a bootstrapping problem there: if one would be to rebuild the
Puppet server, it would actually fail to compile its catalog because
it would not be able to connect to the LDAP server to fetch its
catalog, unless the LDAP server has been manually configured to let
the Puppet server through.
NOTE: much (if not all?) of this is being moved into Hiera, in
particular the YAML files. See [issue 30020](https://trac.torproject.org/projects/tor/ticket/30020) for details. Moving
the host information into Hiera would resolve the bootstrapping
issues, but would require, in turn some more work to resolve questions
like how users get granted access to individual hosts, which is
currently managed by `ud-ldap`. We cannot, therefore, simply move host
information from LDAP into Hiera without creating a duplicate source
of truth without rebuilding or tweaking the user distribution
system. See also the [LDAP design document](ldap#Design) for more information
about how LDAP works.
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
### Let's Encrypt TLS certificates
Public TLS certificates, as issued by Let's Encrypted, are distributed
by Puppet. Those certificates are generated by the "letsencrypt" Git
repository (see the [TLS documentation](tls) for details on that
workflow). The relevant part, as far as Puppet is concerned, is that
certificates magically end up in the following directory when a
certificate is issued or (automatically) renewed:
/srv/puppet.torproject.org/from-letsencrypt
See also the [TLS deployment docs](tls#lets-encrypt-workflow) for how that directory gets
populated.
Normally, those files would not be available from the Puppet
manifests, but the `ssl` Puppet module uses a special trick whereby
those files are read by Puppet `.erb` templates. For example, this is
how `.crt` files get generated on the Puppet master, in
`modules/ssl/templates/crt.erb`:
<%=
fn = "/srv/puppet.torproject.org/from-letsencrypt/#{@name}.crt"
out = File.read(fn)
out
%>
Similar templates exist for the other files.
Those certificates should not be confused with the "auto-ca" TLS certificates
in use internally and which are deployed directly using a symlink from the
environment's `modules/ssl/files/` to `/var/lib/puppetserver/auto-ca`, see
below.
### Internal auto-ca TLS certificates
The Puppet server also manages an internal CA which we informally call
"auto-ca". Those certificates are internal in that they are used to
authenticate nodes to each other, not to the public. They are used, for
example, to encrypt connections between mail servers (in Postfix) and
[backup servers](backup) (in Bacula).
The auto-ca deploys those certificates into an "auto-ca" directory under the
Puppet "$vardir", `/var/lib/puppetserver/auto-ca`, which is symlinked from the
environment's `modules/ssl/files/`. Details of that system are available in the
[TLS documentation](tls#internal-auto-ca).
## Issues
There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [team issue tracker][search] with the ~Puppet
label.
[File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
[search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues?label_name%5B%5D=Puppet
## Monitoring and testing
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
Puppet is monitored using Prometheus through the [Prometheus
reporter](https://forge.puppet.com/puppet/prometheus_reporter). This is a small Ruby module that ingests reports posted
by Puppet agent to the Puppet server and writes metrics to the
Prometheus node exporter textfile collector, in
`/var/lib/prometheus/node-exporter`.
We were previously checking Puppet *twice* when we were running
Icinga:
* One job ran on the Puppetmaster and checked PuppetDB for
reports. This was done with a [patched](https://github.com/evgeni/check_puppetdb_nodes/pull/14) version of the
[check_puppetdb_nodes](https://github.com/evgeni/check_puppetdb_nodes/) Nagios check, shipped inside the
`tor-nagios-checks` Debian package
* The same job actually runs twice; once to check all manifests, and
another to check each host individually and assign the result to
the right host.
The twin checks were present so that we could find stray Puppet hosts.
For example, if a host was retired from Icinga but not retired from
Puppet, or added to Icinga but not Puppet, we would notice. This was
necessary because the Icinga setup was not Puppetized: the twin check
now seems superfluous and we only check reports on the server.
Note that we *could* check agents individually with the [puppet agent
exporter](https://github.com/retailnext/puppet-agent-exporter).
There are no validation checks and *a priori* no peer review of code:
code is directly pushed to the Puppet server without validation. Work
is being done to [implement automated checks](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31226) but that is only
being deployed on the client side for now, and voluntarily. See the
[Validating Puppet code section](#validating-puppet-code) above.
PuppetDB exposes a performance dashboard which is accessible via web. To reach
it, first establish an ssh forwarding to `puppetdb-01` on port 8080 as
described on this page, and point your browser at
http://localhost:8080/pdb/dashboard/index.html
PuppetDB itself also holds performance information about the Puppet agent runs,
which are called "reports". Those reports contain information about changes
operated on each server, how long the agent runs take and so on. Those metrics
could be made more visible by using a dashboard, but that has not been
implemented yet (see [issue 31969][]).
[issue 31969]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31969
The Puppet server, Puppet agents and PuppetDB keep logs of their
operations. The latter keeps its logs in `/var/log/puppetdb/` for a
maximum of 90 days or 1GB, whichever comes first (configured in
`/etc/puppetdb/request-logging.xml` and
`/etc/puppetdb/logback.xml`). The other logs are sent to `syslog`, and
usually end up in `daemon.log`.
Puppet should hold minimal personally identifiable information, like
user names, user public keys and project names.
## Other documentation
* [Latest Puppet docs](https://puppet.com/docs/puppet/latest/puppet_index.html) - might be too new, see also the [Puppet
5.5 docs](https://puppet.com/docs/puppet/5.5/puppet_index.html)
* [Function reference](https://puppet.com/docs/puppet/latest/function.html)
* [Type reference](https://puppet.com/docs/puppet/latest/type.html)
* [Mapping between versions of Puppet Entreprise, Facter, Hiera, Agent, etc](https://puppet.com/docs/pe/2019.0/component_versions_in_recent_pe_releases.html)
# Discussion
This section goes more in depth into how Puppet is setup, why it was
setup the way it was, and how it could be improved.
## Overview
Our Puppet setup dates back from 2011, according to the git history,
and was probably based off the [Debian System Administrator's Puppet
codebase](https://salsa.debian.org/dsa-team/mirror/dsa-puppet) which dates back to 2009.
## Goals
The general goal of Puppet is to provide basic automation across the
architecture, so that software installation and configuration, file
distribution, user and some service management is done from a central
location, managed in a git repository. This approach is often called
[Infrastructure as code](https://en.wikipedia.org/wiki/Infrastructure_as_Code).