Newer
Older
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
As mentioned in the [installation section](#installation), the Puppet server
assumes a few components (namely [LDAP](ldap), [Nagios](nagios), [Let's
Encrypt](tls) and auto-ca) feed information into it. This is also
detailed in the sections below. In particular, Puppet consistutes a
duplicate "source of truth" for some information about servers. For
example, LDAP has a "purpose" field describing what a server is for,
but Puppet also has the concept of a role, attributed through Hiera
(see [issue 30273][]). A similar problem exists with IP addresses and
user access control, in general.
[issue 30273]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30273
Puppet is generally considered stable, but the code base is somewhat
showing its age and has accumulated some technical debt.
For example, much of the Puppet code deployed is specific to Tor (and
DSA, to a certain extent) and therefore is only maintained by a
handful of people. It would be preferable to migrate to third-party,
externally maintained modules (e.g. [systemd](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33449), but also many
others, see [issue 29387][] for details). A similar problem exists
with custom Ruby code implemented for various functions, which is
being replaced with Hiera ([issue 30020][]).
The Puppet infrastructure being kept up to date with the latest
versions in Debian but will require some work to port to Puppet 6, as
the current deployment system ("puppetmaster") has been removed in
that new release (see [issue 33588][]).
[issue 33588]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/33588
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
### Glossary
This is a subset of the [Puppet glossary](https://puppet.com/docs/puppet/latest/glossary.html) to quickly get you
started with the vocabulary used in this document.
* **Puppet node**: a machine (virtual or physical) running Puppet
* **Manifest**: Puppet source code
* **Catalog**: a set of compiled of Puppet source which gets applied
on a **node** by a **Puppet agent**
* **Puppet agents**: the Puppet program that runs on all nodes to
apply manifests
* **Puppet server**: the server which all **agents** connect to to
fetch their **catalog**, also known as a **Puppet master** in older
Puppet versions (pre-6)
* **Facts**: information collected by Puppet agents on nodes, and
exported to the Puppet server
* **Reports**: log of changes done on nodes recorded by the Puppet
server
* **[PuppetDB](https://puppet.com/docs/puppetdb/) server**: an application server on top of a PostgreSQL
database providing an [API](https://puppet.com/docs/puppetdb/5.2/api/index.html) to query various resources like node
names, facts, reports and so on
The Puppet server and PuppetDB server run on
`pauli.torproject.org`. That is where the main git repository
(`tor-puppet`) lives, in
`/srv/puppet.torproject.org/git/tor-puppet`. That repository has hooks
to populate `/etc/puppet` which is the live checkout from which the
Puppet server compiles its catalogs.
All paths below are relative to the root of that git repository.
- `3rdparty/modules` include modules that are shared publicly and do
not contain any TPO-specific configuration. There is a `Puppetfile`
there that documents where each module comes from and that can be
maintained with [r10k][] or [librarian][].
[librarian]: https://librarian-puppet.com/
[r10k]: https://github.com/puppetlabs/r10k/
- `modules` includes roles, profiles, and classes that make the bulk
of our configuration.
- each node is assigned a "role" through Hiera, in
`hiera/nodes/$FQDN.yaml`
To be more accurate, Hiera assigns a Puppet class to each node,
although each node should have only one special purpose class, a
"role", see [issue 40030][] for progress on that transition.
[issue 40030]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40030
- The `torproject_org` module
(`modules/torproject_org/manifests/init.pp`) performs basic host
initialisation, like configuring Debian mirrors and APT sources,
installing a base set of packages, configuring puppet and timezone,
setting up a bunch of configuration files and running `ud-replicate`.
- There is also the `hoster.yaml` file
(`modules/torproject_org/misc/hoster.yaml`) which defines hosting
providers and specifies things like which network blocks they use,
if they have a DNS resolver or a Debian mirror. `hoster.yaml` is read
by
- the `nodeinfo()` function
(`modules/puppetmaster/lib/puppet/parser/functions/nodeinfo.rb`),
used for setting up the `$nodeinfo` variable
- `ferm`'s `def.conf` template (`modules/ferm/templates/defs.conf.erb`)
the `manifests/site.pp` file, but this file is now mostly empty, in
favor of Hiera.
Note that the above is the current state of the file hierarchy. As
part Hiera transition ([issue 30020][]), a lot of the above
architecture will change in favor of the more standard
Note that this layout might also change in the future with the
introduction of a role account ([issue 29663][]) and when/if the
repository is made public (which requires changing the layout).
See [ticket #29387][] for an in-depth discussion.
[issue 29387]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387
[role/profile/module]: https://puppet.com/docs/pe/2017.2/r_n_p_intro.html
[ticket #29387]: https://bugs.torproject.org/29387
[issue 30020]: https://bugs.torproject.org/30020
### Installed packages facts
The `modules/torproject_org/lib/facter/software.rb` file defines our
custom facts, making it possible to get answer to questions like "Is
this host running `apache2`?" by simply looking at a puppet
variable.
Those facts are deprecated and we should instead install packages
through Puppet instead of manually installing packages on hosts.
Puppet manifests should generally follow the [Puppet style
guide][]. This can be easily done with [Flycheck][] in Emacs,
[vim-puppet][], or a similar plugin in your favorite text editor.
Many files do not *currently* follow the style guide, as they
*predate* the creation of said guide. Files should *not* be completely
reformatted unless there's a good reason. For example, if a
conditional covering a large part of a file is removed and the file
needs to be re-indented, it's a good opportunity to fix style in the
file. Same if a file is split in two components or for some other
reason completely rewritten.
Otherwise the style already in use in the file should be followed.
[Puppet style guide]: https://puppet.com/docs/puppet/4.8/style_guide.html
[Flycheck]: http://flycheck.org/
[vim-puppet]: https://github.com/rodjek/vim-puppet
[Hiera][] is a "key/value lookup tool for configuration data" which
Puppet uses to look up values for class parameters and node
configuration in General.
We are in the process of transitioning over to this mechanism from our
previous set of custom YAML lookup system. This documents the way we
currently use Hiera.
[Hiera]: https://puppet.com/docs/hiera/3.2/
Each host declares which class it should include through a `classes`
parameter. For example, this is what configures a Prometheus server:
classes:
- roles::monitoring
Roles should be *abstract* and *not* implementation specific. Each
role includes a set of profiles which *are* implementation
specific. For example, the `monitoring` role includes
`profile::prometheus::server` and `profile::grafana`. Do *not* include
profiles directly from Hiera.
As a temporary exception to this rule, old modules can be included as
we transition from the `has_role` mechanism to Hiera, but eventually
those should be ported to shared modules from the Puppet forge, with
our glue built into a profile on top of the third-party module. The
role `roles::monitoring` follows that pattern correctly. See [issue
40030][] for progress on that work.
On top of the host configuration, some node-specific configuration can
be performed from Hiera. This should be avoided as much as possible,
but sometimes there is just no other way. A good example was the
`build-arm-*` nodes which included the following configuration:
bacula::client::ensure: "absent"
This disables backups on those machines, which are normally configured
everywhere. This is done because they are behind a firewall and
therefore not reachable, an unusual condition in the network. Another
example is `nutans` which sits behind a NAT so it doesn't know its own
IP address. To export proper firewall rules, the allow address has
bind::secondary::allow_address: 89.45.235.22
Those types of parameters are normally automatically guess inside
Note: eventually *all* host configuration will be done here, but there
are currently still some configurations hardcoded in individual
modules. For example, the Bacula director is hardcoded in the `bacula`
base class (in `modules/bacula/manifests/init.pp`). That should be
moved into a class parameter, probably in `common.yaml`.
The Puppet agent is *not* running as a daemon, it's running through
good old `cron`.
Puppet runs on each node every four hour, although with a random 2h
jitter, so the actual frequency is somewhere between 4 and 6
hours.
This configuration is in `/etc/cron.d/puppet-crontab` and deployed by
Puppet itself, currently as part of the `torproject_org` module.
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
The Puppet is configured to talk with Puppet through a few custom
functions defined in
`modules/puppetmaster/lib/puppet/parser/functions`. The main plumbing
function is called `ldapinfo()` and connects to the LDAP server
through `db.torproject.org` over TLS on port 636. It takes a hostname
as an argument and will load all hosts matching that pattern under the
`ou=hosts,dc=torproject,dc=org` subtree. If the specified hostname is
the `*` wildcard, the result will be a hash of `host => hash` entries,
otherwise only the `hash` describing the provided host will be
returned.
The `nodeinfo()` function uses that function to populate the global
`$nodeinfo` hash available globally, or, more specifically, the
`$nodeinfo['ldap']` component. It also loads the `$nodeinfo['hoster']`
value from the `whohosts()` function. That function, in turn, tries to
match the IP address of the host against the "hosters" defined in the
`hoster.yaml` file.
The `allnodeinfo()` function does a similar task as `nodeinfo()`,
except that it loads *all* nodes from LDAP, into a single hash. It
does *not* include the "hoster" and is therefore equivalent to calling
`nodeinfo()` on each host and extracting only the `ldap` member hash
(although it is not implemented that way).
Puppet does not require any special credentials to access the LDAP
server. It accesses the LDAP database anonymously, although there is a
firewall rule (defined in Puppet) that grants it access to the LDAP
server.
There is a bootstrapping problem there: if one would be to rebuild the
Puppet server, it would actually fail to compile its catalog because
it would not be able to connect to the LDAP server to fetch its
catalog, unless the LDAP server has been manually configured to let
the Puppet server through.
NOTE: much (if not all?) of this is being moved into Hiera, in
particular the YAML files. See [issue 30020](https://trac.torproject.org/projects/tor/ticket/30020) for details. Moving
the host information into Hiera would resolve the bootstrapping
issues, but would require, in turn some more work to resolve questions
like how users get granted access to individual hosts, which is
currently managed by `ud-ldap`. We cannot, therefore, simply move host
information from LDAP into Hiera without creating a duplicate source
of truth without rebuilding or tweaking the user distribution
system. See also the [LDAP design document](ldap#Design) for more information
about how LDAP works.
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
Nagios (which is really Icinga, but let's call it Nagios because
that's how it's called everywhere in the source) is hooked into Puppet
through an external sync system. Our [Nagios deployment](nagios) operates
through Git hooks which run a special `Makefile` that compiles and
deploys the Icinga configuration, but also compiles the client-side
NRPE configuration.
The NRPE configuration is generated on the Nagios server and then
pushed to the Puppet server with `rsync` over SSH, using a public key
distributed by Puppet from the `roles::puppetmaster` class. That key
has a restricted `command` field which limits access to the Puppet
manifest, in this single file:
/etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg
This file then gets distributed to all nodes through the
`nagios::client` class using a simple `File` resource.
So when a Nagios check is added or changed, Puppet needs to run on all
the affected host for the check to take affect, on top of, of course,
adding the check into the Nagios git repository.
### Let's Encrypt TLS certificates
Public TLS certificates, as issued by Let's Encrypted, are distributed
by Puppet. Those certificates are generated by the "letsencrypt" Git
repository (see the [TLS documentation](tls) for details on that
workflow). The relevant part, as far as Puppet is concerned, is that
certificates magically end up in the following directory when a
certificate is issued or (automatically) renewed:
/srv/puppet.torproject.org/from-letsencrypt
See also the [TLS deployment docs](tls#lets-encrypt-workflow) for how that directory gets
populated.
Normally, those files would not be available from the Puppet
manifests, but the `ssl` Puppet module uses a special trick whereby
those files are read by Puppet `.erb` templates. For example, this is
how `.crt` files get generated on the Puppet master, in
`modules/ssl/templates/crt.erb`:
<%=
fn = "/srv/puppet.torproject.org/from-letsencrypt/#{@name}.crt"
out = File.read(fn)
out
%>
Similar templates exist for the other files.
Those certificates should not be confused with the "auto-ca" TLS
certificates in use internally and which are deployed directly in
`/etc/puppet/modules/ssl/files/`, see below.
### Internal auto-ca TLS certificates
The Puppet server also manages an internal CA which we informally call
"auto-ca". Those certificates are internal in that they are used to
authentify nodes to each other, not to the public. They are used, for
example, to encrypt connexions between mail servers (in Postfix) and
[backup servers](backup) (in Bacula).
The auto-ca deploys those certificates directly inside the Puppet
server checkout, in `/etc/puppet/modules/ssl/files/certs/` and
`.../clientcerts/`. Details of that system are available in the [TLS documentation](tls#internal-auto-ca).
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
## Issues
There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [team issue tracker][search] component.
[File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
[search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues
## Monitoring and testing
Puppet is hooked into Nagios in two ways:
* one job runs on the Puppetmaster and checks PuppetDB for
reports. this was done with a [patched](https://github.com/evgeni/check_puppetdb_nodes/pull/14) version of the
[check_puppetdb_nodes](https://github.com/evgeni/check_puppetdb_nodes/_) Nagios check, now packaged inside the
`tor-nagios-checks` Debian package
* another job runs on each Puppet node and will therefore work even
if the Puppetmaster dies for some reason. this is done with the
[check_puppet_agent](https://github.com/aswen/nagios-plugins/blob/master/check_puppet_agent) Nagios check, now also packaged inside the
`tor-nagios-checks` Debian package
This was [implemented in March 2019](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29676). An alternative implementation
[using Prometheus](https://forge.puppet.com/puppet/prometheus_reporter) was considered but [Prometheus still hasn't
replaced Nagios](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864) at the time of writing.
There are no validation checks and *a priori* no peer review of code:
code is directly pushed to the Puppet server without validation. Work
is being done to [implement automated checks](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31226) but that is only
being deployed on some clients for now.
Note that PuppetDB itself holds performance information about the
Puppet agent runs, which are called "reports". Those reports contain
information about changes operated on each server, how long the agent
runs take and so on. Those metrics could be made more visible by using
a dashboard, but that has not been implemented yet (see [issue
31969][]).
[issue 31969]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31969
The Puppet server, Puppet agents and PuppetDB keep logs of their
operations. The latter keeps its logs in `/var/log/puppetdb/` for a
maximum of 90 days or 1GB, whichever comes first (configured in
`/etc/puppetdb/request-logging.xml` and
`/etc/puppetdb/logback.xml`). The other logs are sent to `syslog`, and
usually end up in `daemon.log`.
Puppet should hold minimal personnally idenfiable information, like
user names, user public keys and project names.
# Discussion
This section goes more in depth into how Puppet is setup, why it was
setup the way it was, and how it could be improved.
## Overview
Our Puppet setup dates back from 2011, according to the git history,
and was probably based off the [Debian System Administrator's Puppet
codebase](https://salsa.debian.org/dsa-team/mirror/dsa-puppet) which dates back to 2009.
## Goals
The general goal of Puppet is to provide basic automation across the
architecture, so that software installation and configuration, file
distribution, user and some service management is done from a central
location, managed in a git repository. This approach is often called
[Infrastructure as code](https://en.wikipedia.org/wiki/Infrastructure_as_Code).
This section also documents possible improvements to our Puppet
configuration that we are considering.
* **secure**: only sysadmins should have access to push configuration,
whatever happens. this includes deploying only audited and verified
Puppet code into production.
* **code review**: changes on servers should be verifiable by our peers,
through a git commit log
* **fix permissions issues**: deployment system should allow all admins
to push code to the puppet server without having to constantly fix
permissions (e.g. trough a [role account](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29663))
* **secrets handling**: there are some secrets in Puppet. those
should remain secret.
We mostly have this now, although there are concerns about permissions
being wrong sometimes, which a role account could fix.
Those are mostly issues with the current architecture we'd like to fix:
* **Continuous Integration**: before deployment, code should be vetted by
a peer and, ideally, automatically checked for errors and tested
* **single source of truth**: when we add/remove nodes, we should not
have to talk to multiple services (see also the [install automation
ticket](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31239) and the [new-machine discussion](new-machine#discussion)
* **collaboration** with other sysadmins outside of TPA, for which we
would need to...
* ... **publicize our code** (see [ticket 29387](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387))
* **no manual changes**: every change on every server should be committed
to version control somewhere
* **bare-metal recovery**: it should be possible to recover a service's
*configuration* from a bare Debian install with Puppet (and with
data from the [backup](backup) service of course...)
* **one commit only**: we shouldn't have to commit "twice" to get
changes propagated (once in a submodule, once in the parent module,
for example)
* **ad hoc changes** to the infrastructure. one-off jobs should be
handled by [fabric](fabric), Cumin, or straight SSH.
## Approvals required
TPA should approve policy changes as per [tpa-rfc-1](/policy/tpa-rfc-1-policy).
## Proposed Solution
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
To improve on the above "Goals", I would suggest the following
configuration.
TL;DR:
1. Use a control repository
2. Get rid of 3rdparty
3. Deploy with g10k
4. Authenticate with checksums
5. Deploy to branch-specific environments
6. Rename the default branch "production"
7. Push directly on the Puppet server
8. Use a role account
9. Use local test environments
10. Develop a test suite
11. Hook into CI
12. OpenPGP verification and web hook
Steps 1-8 could be implemented without too much difficulty and should
be a mid term objective. Steps 9 to 12 require significantly more work
and could be implemented once the new infrastructure stabilizes.
What follows is an explanation and justification of each step.
### Use a control repository
The base of the infrastructure is a [control-repo](https://puppet.com/docs/pe/latest/control_repo.html) ([example](https://github.com/puppetlabs/control-repo))
which chain-loads all the other modules. This implies turning all our
"modules" into "profiles" and moving "real" modules (which are fit for
public consumption) "outside", into public repositories (see also
[issue 29387: publish our puppet repository](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387)).
Note that the control repository *could* also be public: we could
simply have the private data inside of Hiera or some other private
repository.
The control repository concept is specific to the proprietary version
of Puppet (Puppet Enterprise or PE) but its logic should be usable
with the open source Puppet release as well.
### Get rid of 3rdparty
The control repo's core configuration file is the `Puppetfile`. We
already use a Puppetfile, but only to manage modules inside of the
`3rdparty` directory. Now it would manage *all* modules, or, more
specifically, `3rdparty` would become the default `modules` directory
which would, incidentally, encourage us to upstream our modules and
publish them to the world.
Our current `modules` directory would move into `site-modules`, which
is the designated location for "roles, profiles, and custom
modules". This has been suggested before in [issue 29387: publish our
puppet repository](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29387)) and is important for the `Puppetfile` to do its
job.
### Deploy with g10k
It seems clear that everyone is converging over the use of a
`Puppetfile` to deploy code. While there are still monorepos out
there, but they do make our life harder, especially when we need to
operate on non-custom modules.
Instead, we should converge towards *not* following upstream modules
in our git repository. Modules managed by the `Puppetfile` would *not*
be managed in our git monorepo and, instead, would be deployed by
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
`r10k` or `g10k` (most likely the latter because of its support for
checksums).
Note that neither `r10k` or `g10k` resolve dependencies in a
`Puppetfile`. We therefore also need a tool to verify the file
correctly lists all required modules. The following solutions need to
be validated but could address that issue:
* [generate-puppetfile](https://github.com/rnelson0/puppet-generate-puppetfile): take a `Puppetfile` and walk the
dependency tree, generating a new `Puppetfile` (see also [this
introduction to the project](https://rnelson0.com/2015/11/06/introducing-generate-puppetfile-or-creating-a-ruby-program-to-update-your-puppetfile-and-fixtures-yml/))
* [Puppetfile-updater](https://github.com/camptocamp/puppetfile-updater): read the `Puppetfile` and fetch new releases
* [ra10ke](https://github.com/voxpupuli/ra10ke): a bunch of Rake tasks to validate a `Puppetfile`
* `r10k:syntax`: syntax check, see also `r10k puppetfile check`
* `r10k:dependencies`: check for out of date dependencies
* `r10k:solve_dependencies`: check for **missing** dependencies
* `r10k:install`: wrapper around `r10k` to install with some
caveats
* `r10k:validate`: make sure modules are accessible
* `r10k:duplicates`: look for duplicate declarations
* [lp2r10k](https://github.com/dharmabruce/lp2r10k/): convert "librarian" `Puppetfile` (missing
dependencies) into a "r10k" `Puppetfile` (with dependencies)
Note that this list comes from the [updating your Puppetfile](https://github.com/puppetlabs/r10k/blob/master/doc/updating-your-puppetfile.mkd#automatic-updates)
documentation in the r10k project, which is also relevant here.
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
### Authenticate code with checksums
This part is the main problem with moving away from a monorepo. By
using a monorepo, we can audit the code we push into production. But
if we offload this to `r10k`, it can download code from wherever the
`Puppetfile` says, effectively shifting our trust path from OpenSSH
to HTTPS, the Puppet Forge, git and whatever remote gets added to the
`Puppetfile`.
There is no obvious solution for this right now, surprisingly. Here
are two possible alternatives:
1. [g10k](https://github.com/xorpaul/g10k/) supports using a `:sha256sum` parameter to checksum
modules, but that only works for Forge modules. Maybe we could
pair this with using an explicit `sha1` reference for git
repository, ensuring those are checksummed as well. The downside
of that approach is that it leaves checked out git repositories in
a "detached head" state.
2. `r10k` has a [pending pull request](https://github.com/puppetlabs/r10k/pull/823) to add a `filter_command`
directive which could run after a git checkout has been
performed. it could presumably be used to verify OpenPGP
signatures on git commits, although this would work only on
modules we sign commits on (and therefore not third party)
It seems the best approach would be to use g10k for now with checksums
on both git commit and forge modules.
A validation hook running *before* g10k COULD validate that all `mod`
lines have a `checksum` of some sort...
Note that this approach does *NOT* solve the "double-commit" problem
identified in the Goals. It is believed that only a "monorepo" would
fix that problem and that approach comes in direct conflict with the
"collaboration" requirement. We chose the latter.
This could be implemented as a patch to `ra10ke`.
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
### Deploy to branch-specific environments
A key feature of r10k (and, of course, g10k) is that they are capable
of deploying code to new environments depending on the branch we're
working on. We would enable that feature to allow testing some large
changes to critical code paths without affecting all servers.
### Rename the default branch "production"
In accordance with Puppet's best practices, the control repository's
default branch would be called "production" and not "master".
Also: Black Lives Matter.
### Push directly on the Puppet server
Because we are worried about the GitLab attack surface, we could still
keep on pushing to the Puppet server for now. The control repository
could be mirrored to GitLab using a deploy key. All other repositories
would be published on GitLab anyways, and there the attack surface
would not matter because of the checksums in the control repository.
### Use a role account
To avoid permission issues, use a role account (say `git`) to accept
pushes and enforce git hooks.
### Use local test environments
It should eventually be possible to test changes locally before
pushing to production. This would involve radically simplifying the
Puppet server configuration and probably either getting rid of the
LDAP integration or at least making it optional so that changes can be
tested without it.
This would involve "puppetizing" the Puppet server configuration so
that a Puppet server and test agent(s) could be bootstrapped
automatically. Operators would run "smoke tests" (running Puppet by
hand and looking at the result) to make sure their code works before
pushing to production.
### Develop a test suite
The next step is to start working on a test suite for services, at
least for new deployments, so that code can be tested without running
things by hand. Plenty of Puppet modules have such test suite,
generally using [rspec-puppet](https://rspec-puppet.com/) and [rspec-puppet-facts](https://github.com/mcanevet/rspec-puppet-facts), and we
already have a few modules in `3rdparty` that have such tests. The
idea would be to have those tests on a per-role or per-profile basis.
The Foreman people have published [their test infrastructure](https://github.com/theforeman/foreman-infra/tree/master/puppet) which
could be useful as inspiration for our purposes here.
### Hook into continuous integration
Once tests are functional, the last step is to move the control
repository into GitLab directly and start running CI against the
Puppet code base. This would probably not happen until GitLab CI is
deployed, and would require lots of work to get there, but would
eventually be worth it.
The GitLab CI would be indicative: an operator would need to push to a
topic branch there first to confirm tests pass but would still push
directly to the Puppet server for production.
Note that we are working on (client-side) validation hooks for now,
see [issue 31226][].
[issue 31226]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31226
### OpenPGP verification and web hook
To stop pushing directly to the Puppet server, we could implement
OpenPGP verification on the control repository. If a hook checks that
commits are signed by a trusted party, it does not matter where the
code is hosted.
A good reference for OpenPGP verification is [this guix article](https://guix.gnu.org/blog/2020/securing-updates/) which covers a few scenarios.
We could use the [webhook](https://github.com/voxpupuli/puppet_webhook) system to have GitLab notify the Puppet
server to pull code.
## Cost
N/A.
## Alternatives considered
Ansible was considered for managing [GitLab](gitlab) for a while, but
this was eventually abandoned in favor of using Puppet and the
"Omnibus" package.
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
For code management, I have done a more extensive review of possible
alternatives. [This talk](https://www.youtube.com/watch?v=RdIyStATgFE) is a good introduction for git submodule,
librarian and r10k. Based on that talk and [these slide](https://arlimus.github.io/slides/librarian.and.r10k/), I've made
the following observations:
### monorepo
This is our current approach, which is that all code is committed in
one monolithic repository. This effectively makes it impossible to
share code outside of the repository with anyone else because there is
private data inside, but also because it doesn't follow the standard
role/profile/modules separation that makes collaboration possible at
all. To work around that, I designed a workflow where we locally clone
subrepos as needed, but this is clunky as it requies to commit every
change twice: one for the subrepo, one for the parent.
Our giant monorepo also mixes all changes together which can be an pro
*and* a con: on the one hand it's easy to see and audit all changes at
once, but on the other hand, it can be overwhelming and confusing.
But it does allow us to integrate with librarian right now and is a
good stopgap solution. A better solution would need to solve the
"double-commit" problem and still allow us to have smaller
repositories that we can collaborate on outside of our main tree.
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
The talk partially covers how difficult `git submodules` work and how
hard they are to deal with. I say partially because submodules are
even harder to deal with than the examples she gives. She shows how
submodules are hard to add and remove, because the metadata is stored
in stored in multiple locations (`.gitsubmodules`, `.git/config`,
`.git/modules/` and the submodule repository itself).
She also mentions submodules don't know about dependencies and it's
likely you will break your setup if you forget one step. (See [this
post](https://web.archive.org/web/20171101202911/http://somethingsinistral.net/blog/git-submodules-are-probably-not-the-answer/) for more examples.)
In my experience, the biggest annoyance with submodules is the
"double-commit" problem: you need to make commits in the submodule,
then *redo* the commits in the parent repository to chase the head of
that submodule. This does not improve on our current situation, which
is that we need to do those two commits anyways in our giant monorepo.
One advantage with submodules is that they're mostly standard:
everyone knows about them, even if they're not familiar and their
knowledge is reusable outside of Puppet.
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
Librarian is written in ruby. It's built on top of [another library
called librarian](https://github.com/applicationsonline/librarian) that is used by Ruby's [bundler](https://gembundler.com/). At the time
of the talk, was "pretty active" but unfortunately, librarian now
seems to be [abandoned](https://github.com/voxpupuli/librarian-puppet/issues/48) so we might be forced to use r10k in the
future, which has a quite different workflow.
One problem with librarian right now is that `librarian update` clears
any existing git subrepo and re-clones it from scratch. If you have
temporary branches that were not pushed remotely, all of those are
lost forever. That's really bad and annoying! it's by design: it
"takes over your modules directory", as she explains in the talk and
everything comes from the Puppetfile.
Librarian does resolve dependencies recursively and store the decided
versions in a lockfile which allow us to "see" what happens when you
update from a Puppetfile.
But there's no cryptographic chain of trust between the repository
where the Puppetfile is and the modules that are checked out. Unless
the module is checked out from git (which isn't the default), only
version range specifiers constrain which code is checked out, which
gives a huge surface area for arbitrary code injection in the entire
puppet infrastructure (e.g. MITM, forge compromise, hostile upstream
attacks)
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
r10k was written because librarian was too slow for large
deployments. But it covers more than just managing code: it also
manages environments and is designed to run on the Puppet master. It
doesn't have dependency resolution or a `Puppetfile.lock`,
however. See [this ticket](https://github.com/puppetlabs/r10k/issues/38), closed in favor of [that one](https://tickets.puppetlabs.com/browse/RK-3).
r10k is more complex and very opiniated: it requires lots of
configuration including its own YAML file, hooks into the Puppetmaster
and can [take a while to deploy](http://garylarizza.com/blog/2014/02/18/puppet-workflow-part-3/). r10k is still in [active
development](https://github.com/puppetlabs/r10k/releases) and is supported by Puppetlabs, so there's [official
documentation](https://puppet.com/docs/pe/2019.1/r10k.html) in the Puppet documentation.
Often used in conjunction with librarian for dependency resolution.
One cool feature is that r10k allows you to create dynamic
environments based on branch names. All you need is a single repo with
a Puppetfile and r10k handles the rest. The problem, of course, is
that you need to trust it's going to do the right thing. There's the
security issue, but there's also the problem of resolving dependencies
and you *do* end up double-committing in the end if you use branches
in sub-repositories. But maybe that is unavoidable.
(Note that there are ways of resolving dependencies with external
tools, like [generate-puppetfile](https://github.com/rnelson0/puppet-generate-puppetfile) ([introduction](https://rnelson0.com/2015/11/06/introducing-generate-puppetfile-or-creating-a-ruby-program-to-update-your-puppetfile-and-fixtures-yml/)) or [this hack
that reformats librarian output](https://github.com/dharmabruce/lp2r10k/blob/master/lp2r10k) or [those rake tasks](https://github.com/voxpupuli/ra10ke). there's
also a [go rewrite called g10k](https://github.com/xorpaul/g10k) that is much faster, but with
similar limitations.)
[This article](https://web.archive.org/web/20171107082413/http://somethingsinistral.net/blog/scaling-puppet-environment-deployment/) mentions git subtrees from the point of view of
Puppet management quickly. It outline how it's cool that the history
of the subtree gets merged as is in the parent repo, which gives us
the best of both world (individual, per-module history view along with
a global view in the parent repo). It makes, however, rebasing in
subtrees impossible, as it breaks the parent merge. You do end up with
some of the disadvantages of the monorepo in the all the code is
actually committed in the parent repo and you *do* have to commit
twice as well.
TODO. https://github.com/ingydotnet/git-subrepo
[myrepos](https://myrepos.branchable.com/) is one of many solutions to manage multiple git
repositories. It has been used in the past at my old workplace
(Koumbit.org) to manage and checkout multiple git repositories.
Like Puppetfile without locks, it doesn't enforce cryptographic
integrity between the master repositories and the subrepositories: all
it does is define remotes and their locations.
Like r10k it doesn't handle dependencies and will require extra setup,
although it's much lighter than r10k.
Its main disadvantage is that it isn't well known and might seem
esoteric to people. It also has weird failure modes, but could be used
in parallel with a monorepo. For example, it might allow us to setup
specific remotes in subdirectories of the monorepo automatically.
| Approach | Pros | Cons | Summary |
|------------|----------------------------|------------------------------------------|---------------------------|
| Monorepo | Simple | Double-commit | Status quo |
| Submodules | Well-known | Hard to use, double-commit | Not great |
| Librarian | Dep resolution client-side | Unmaintained, bad integration with git | Not sufficient on its own |
| r10k | Standard | Hard to deploy, opiniated | To evaluate further |
| Subtree | "best of both worlds" | Still get double-commit, rebase problems | Not sure it's worth it |
| Subrepo | ? | ? | ? |
| myrepos | Flexible | Esoteric | might be useful with our monorepo |
### Best practices survey
I made a survey of the community (mostly the [shared puppet
modules](https://gitlab.com/shared-puppet-modules-group/) and [Voxpupuli](https://voxpupuli.org/) groups) to find out what the best
current practices are.
Koumbit uses foreman/puppet but pinned at version 10.1 because it is
the last one supporting "passenger" (the puppetmaster deployment
method currently available in Debian, deprecated and dropped from
puppet 6). They [patched it](https://redmine.koumbit.net/projects/theforeman-puppet/repository/revisions/5b1b0b42f2d7d7b01eacde6584d3) to support `puppetlabs/apache < 6`.
They push to a bare repo on the puppet master, then they have
validation hooks (the inspiration for our own hook implementation, see
[issue 31226][]), and a hook deploys the code to the right branch.
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
They were using r10k but stopped because they had issues when r10k
would fail to deploy code atomically, leaving the puppetmaster (and
all nodes!) in an unusable state. This would happen when their git
servers were down without a locally cached copy. They also implemented
branch cleanup on deletion (although that could have been done some
other way). That issue was apparently reported against r10k but never
got a response. They now use puppet-librarian in their custom
hook. Note that it's possible r10k does not actually have that issue
because they found the issue they filed and it was... [against
librarian](https://github.com/voxpupuli/librarian-puppet/issues/73)!
Some people in #voxpupuli seem to use the Puppetlabs Debian packages
and therefore puppetserver, r10k and puppetboards. Their [Monolithic
master](https://voxpupuli.org/docs/monolithic/) architecture uses an external git repository, which pings
the puppetmaster through a [webhook](https://github.com/voxpupuli/puppet_webhook) which deploys a
[control-repo](https://puppet.com/docs/pe/latest/control_repo.html) ([example](https://github.com/puppetlabs/control-repo)) and calls r10k to deploy the
code. They also use [foreman](https://www.theforeman.org/) as a node classifier. that procedure
uses the following modules:
* [puppet/puppetserver](https://forge.puppet.com/puppet/puppetserver)
* [puppetlabs/puppet_agent](https://forge.puppet.com/puppetlabs/puppet_agent)
* [puppetlabs/puppetdb](https://forge.puppet.com/puppetlabs/puppetdb)
* [puppetlabs/puppet_metrics_dashboard](https://forge.puppet.com/puppetlabs/puppet_metrics_dashboard)
* [voxpupuli/puppet_webhook](https://github.com/voxpupuli/puppet_webhook)
* [r10k](https://github.com/puppetlabs/r10k) or [g10k](https://github.com/xorpaul/g10k)
* [Foreman](https://www.theforeman.org/)
They also have a [master of masters](https://voxpupuli.org/docs/master_agent/) architecture for scaling to
larger setups. For scaling, I have found [this article](https://puppet.com/blog/scaling-open-source-puppet/) to be more
interesting, that said.
So, in short, it seems people are converging towards r10k with a
web hook. To validate git repositories, they mirror the repositories
to a private git host.