This should now be a largely functional deployment of mailman 3. There
are still some bits that need testing but we'll use followup changes to
force failure and hold nodes.
This deployment of mailman3 uses upstream docker container images. We
currently hack up uids and gids to accomodate that. We also hack up the
settings file and bind mount it over the upstream file in order to use
host networking. We override the hyperkitty index type to xapian. All
list domains are hosted in a single installation and we use native
vhosting to handle that.
We'll deploy this to a new server and migrate one mailing list domain at
a time. This will allow us to start with lists.opendev.org and test
things like dmarc settings before expanding to the remaining lists.
A migration script is also included, which has seen extensive
testing on held nodes for importing copies of the production data
sets.
Change-Id: Ic9bf5cfaf0b87c100a6ce003a6645010a7b50358
In thinking harder about the bootstrap process, it struck me that the
"bastion" group we have is two separate ideas that become a bit
confusing because they share a name.
We have the testing and production paths that need to find a single
bridge node so they can run their nested Ansible. We've recently
merged changes to the setup playbooks to not hard-code the bridge node
and they now use groups["bastion"][0] to find the bastion host -- but
this group is actually orthogonal to the group of the same name
defined in inventory/service/groups.yaml.
The testing and production paths are running on the executor, and, as
mentioned, need to know the bridge node to log into. For the testing
path this is happening via the group created in the job definition
from zuul.d/system-config-run.yaml. For the production jobs, this
group is populated via the add-bastion-host role which dynamically
adds the bridge host and group.
Only the *nested* Ansible running on the bastion host reads
s-c:inventory/service/groups.yaml. None of the nested-ansible
playbooks need to target only the currently active bastion host. For
example, we can define as many bridge nodes as we like in the
inventory and run service-bridge.yaml against them. It won't matter
because the production jobs know the host that is the currently active
bridge as described above.
So, instead of using the same group name in two contexts, rename the
testing/production group "prod_bastion". groups["prod_bastion"][0]
will be the host that the testing/production jobs use as the bastion
host -- references are updated in this change (i.e. the two places
this group is defined -- the group name in the system-config-run jobs,
and add-bastion-host for production).
We then can return the "bastion" group match to bridge*.opendev.org in
inventory/service/groups.yaml.
This fixes a bootstrapping problem -- if you launch, say,
bridge03.opendev.org the launch node script will now apply the
base.yaml playbook against it, and correctly apply all variables from
the "bastion" group which now matches this new host. This is what we
want to ensure, e.g. the zuul user and keys are correctly populated.
The other thing we can do here is change the testing path
"prod_bastion" hostname to "bridge99.opendev.org". By doing this we
ensure we're not hard-coding for the production bridge host in any way
(since if both testing and production are called bridge01.opendev.org
we can hide problems). This is a big advantage when we want to rotate
the production bridge host, as we can be certain there's no hidden
dependencies.
Change-Id: I137ab824b9a09ccb067b8d5f0bb2896192291883
This replaces hard-coding of the host "bridge.openstack.org" with
hard-coding of the first (and only) host in the group "bastion".
The idea here is that we can, as much as possible, simply switch one
place to an alternative hostname for the bastion such as
"bridge.opendev.org" when we upgrade. This is just the testing path,
for now; a follow-on will modify the production path (which doesn't
really get speculatively tested)
This needs to be defined in two places :
1) We need to define this in the run jobs for Zuul to use in the
playbooks/zuul/run-*.yaml playbooks, as it sets up and collects
logs from the testing bastion host.
2) The nested Ansible run will then use inventory
inventory/service/groups.yaml
Various other places are updated to use this abstracted group as the
bastion host.
Variables are moved into the bastion group (which only has one host --
the actual bastion host) which means we only have to update the group
mapping to the new host.
This is intended to be a no-op change; all the jobs should work the
same, but just using the new abstractions.
Change-Id: Iffb462371939989b03e5d6ac6c5df63aa7708513
Keeping the testing nodes at the other end of the namespace separates
them from production hosts. This one isn't really referencing itself
in testing like many others, but move it anyway.
Change-Id: I2130829a5f913f8c7ecd8b8dfd0a11da3ce245a9
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
Change-Id: Ifd5e270604102806736dffa86dff2bf8b23799c5
Some of our testing makes use of secure communication between testing
nodes; e.g. testing a load-balancer pass-through. Other parts
"loop-back" but require flags like "curl --insecure" because the
self-signed certificates aren't trusted.
To make testing more realistic, create a CA that is distributed and
trusted by all testing nodes early in the Zuul playbook. This then
allows us to sign local certificates created by the letsencrypt
playbooks with this trusted CA and have realistic peer-to-peer secure
communications.
The other thing this does is reworks the letsencrypt self-signed cert
path to correctly setup SAN records for the host. This also improves
the "realism" of our testing environment. This is so realistic that
it requires fixing the gitea playbook :). The Apache service proxying
gitea currently has to override in testing to "localhost" because that
is all the old certificate covered; we can now just proxy to the
hostname directly for testing and production.
Change-Id: I3d49a7b683462a076263127018ec6a0f16735c94
We previously auto updated nodepool builders but not launchers when new
container images were present. This created confusion over what versions
of nodepool opendev is running. Use the same behavior for both services
now and auto restart them both.
There is a small chance that we can pull in an update that breaks things
so we run serially to avoid the most egregious instances of this
scenario.
Change-Id: Ifc3ca375553527f9a72e4bb1bdb617523a3f269e
This section runs as root, but the system-config repo is cloned as
Zuul. This causes a problem for tox, when it installs it calls out to
git which is no longer operates in the directory [1].
[1] 8959555cee
Change-Id: I5c67208c025d29435dcc40c5eeb3b3aa8e5c4d5d
Change I5b9f9dd53eb896bb542652e8175c570877842584 introduced this tee
to capture and encrypt the logs. However, we should make sure to fail
if the ansible runs fail. Switch on pipefail, which will exit with an
error if the earlier parts of the pipeline fail. Also make sure we
run under bash.
Change-Id: I2c4cb9aec3d4f8bb5bb93e2d2c20168dc64e78cb
Our production jobs currently only put their logging locally on the
bastion host. This means that to help maintain a production system,
you effectively need full access to the bastion host to debug any
misbehaviour.
We've long discussed publishing these Ansible runs as public logs, or
via a reporting system (ARA, etc.) but, despite our best efforts at
no_log and similar, we are not 100% sure that secret values may not
leak.
This is the infrastructure for an in-between solution, where we
publish the production run logs encrypted to specific GPG public keys.
Here we are capturing and encrypting the logs of the
system-config-run-* jobs, and providing a small download script to
automatically grab and unencrypt the log files. Obviously this is
just to exercise the encryption/log-download path for these jobs, as
the logs are public.
Once this has landed, I will propose similar for the production jobs
(because these are post-pipeline this takes a bit more fiddling and
doens't run in CI). The variables will be setup in such a way that if
someone wishes to help maintain a production system, they can add
their public-key and then add themselves to the particular
infra-prod-* job they wish to view the logs for.
It is planned that the extant operators will be in the default list;
however this is still useful over the status quo -- instead of having
to search through the log history on the bastion host when debugging a
failed run, they can simply view the logs from the failing build in
Zuul directly.
Depends-On: https://review.opendev.org/c/zuul/zuul-jobs/+/828818/
Change-Id: I5b9f9dd53eb896bb542652e8175c570877842584
This used to be called "bridge", but was then renamed with
Ia7c8dd0e32b2c4aaa674061037be5ab66d9a3581 to install-ansible to be
clearer.
It is true that this is installing Ansible, but as part of our
reworking for parallel jobs this is the also the synchronisation point
where we should be deploying the system-config code to run for the
buildset.
Thus naming this "boostrap-bridge" should hopefully be clearer again
about what's going on.
I've added a note to the job calling out it's difference to the
infra-prod-service-bridge job to hopefully also avoid some of the
inital confusion.
Change-Id: I4db1c883f237de5986edb4dc4c64860390cc8e22
This adds a keycloak server so we can start experimenting with it.
It's based on the docker-compose file Matthieu made for Zuul
(see https://review.opendev.org/819745 )
We should be able to configure a realm and federate with openstackid
and other providers as described in the opendev auth spec. However,
I am unable to test federation with openstackid due its inability to
configure an oauth app at "localhost". Therefore, we will need an
actual deployed system to test it. This should allow us to do so.
It will also allow use to connect realms to the newly available
Zuul admin api on opendev.
It should be possible to configure the realm the way we want, then
export its configuration into a JSON file and then have our playbooks
or the docker-compose file import it. That would allow us to drive
change to the configuration of the system through code review. Because
of the above limitation with openstackid, I think we should regard the
current implementation as experimental. Once we have a realm
configuration that we like (which we will create using the GUI), we
can chose to either continue to maintain the config with the GUI and
appropriate file backups, or switch to a gitops model based on an
export.
My understanding is that all the data (realms configuration and session)
are kept in an H2 database. This is probably sufficient for now and even
production use with Zuul, but we should probably switch to mariadb before
any heavy (eg gerrit, etc) production use.
This is a partial implementation of https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html
We can re-deploy with a new domain when it exists.
Change-Id: I2e069b1b220dbd3e0a5754ac094c2b296c141753
Co-Authored-By: Matthieu Huin <mhuin@redhat.com>
Previously we had set up the test gerrit instance to use the same
hostname as production: review02.opendev.org. This causes some confusion
as we have to override settings specifically for testing like a reduced
heap size, but then also copy settings from the prod host vars as we
override the host vars entirely. Using a new hostname allows us to use a
different set of host vars with unique values reducing confusion.
Change-Id: I4b95bbe1bde29228164a66f2d3b648062423e294
Previously we had a test specific group vars file for the review Ansible
group. This provided junk secrets to our test installations of Gerrit
then we relied on the review02.opendev.org production host vars file to
set values that are public.
Unfortunately, this meant we were using the production heapLimit value
which is far too large for our test instances leading to the occasionaly
failure:
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 9596567552 bytes for committing reserved memory.
We cannot set the heapLimit in the group var file because the hostvar
file overrides those values. To fix this we need to replace the test
specific group var contents with a test specific host var file instead.
To avoid repeating ourselves we also create a new review.yaml group_vars
file to capture common settings between testing and prod. Note we should
look at combining this new file with the gerrit.yaml group_vars.
On the testing side of things we set the heapLimit to 6GB, we change the
serverid value to prevent any unexpected notedb confusion, and we remove
replication config.
Change-Id: Id8ec5cae967cc38acf79ecf18d3a0faac3a9c4b3
The paste service needs an upgrade; since others have created a
lodgeit container it seems worth us keeping the service going if only
to maintain the historical corpus of pastes.
This adds the ansible to deploy lodgeit and a sibling mariadb
container. I have imported a dump of the old data as a test. The
dump is ~4gb and imported it takes up about double that; certainly
nothing we need to be too concerned over. The server will be more
than capable of running the db container alongside the lodgeit
instance.
This should have no effect on production until we decide to switch
DNS.
Change-Id: I284864217aa49d664ddc3ebdc800383b2d7e00e3
This converts our existing puppeted mailman configuration into a set of
ansible roles and a new playbook. We don't try to do anything new and
instead do our best to map from puppet to ansible as closely as
possible. This helps reduce churn and will help us find problems more
quickly if they happen.
Followups will further cleanup the puppetry.
Change-Id: If8cdb1164c9000438d1977d8965a92ca8eebe4df
When we cleaned up the puppet in
I6b6dfd0f8ef89a5362f64cfbc8016ba5b1a346b3 we renamed the group
s/refstack-docker/refstack/ but didn't move the variables and some
other references too.
Change-Id: Ib07d1e9ede628c43b4d5d94b64ec35c101e11be8
This adds a role and related testing to manage our Kerberos KDC
servers, intended to replace the puppet modules currently performing
this task.
This role automates realm creation, initial setup, key material
distribution and replica host configuration. None of this is intended
to run on the production servers which are already setup with an
active database, and the role should be effectively idempotent in
production.
Note that this does not yet switch the production servers into the new
groups; this can be done in a separate step under controlled
conditions and with related upgrades of the host OS to Focal.
Change-Id: I60b40897486b29beafc76025790c501b5055313d
We have seen some poor performance from gitea which may be related to
manage project updates. Start a dstat service which logs to a csv file
on our system-config-run job hosts in order to collect performance info
from our services in pre merge testing. This will include gitea and
should help us evaluate service upgrades and other changes from a
performance perspective before they hit production.
Change-Id: I7bdaab0a0aeb9e1c00fcfcca3d114ae13a76ccc9
All hosts are now running thier backups via borg to servers in
vexxhost and rax.ord.
For reference, the servers being backed up at this time are:
borg-ask01
borg-ethercalc02
borg-etherpad01
borg-gitea01
borg-lists
borg-review-dev01
borg-review01
borg-storyboard01
borg-translate01
borg-wiki-update-test
borg-zuul01
This removes the old bup backup hosts, the no-longer used ansible
roles for the bup backup server and client roles, and any remaining
bup related configuration.
For simplicity, we will remove any remaining bup cron jobs on the
above servers manually after this merges.
Change-Id: I32554ca857a81ae8a250ce082421a7ede460ea3c
It is buggy (throwing exceptions for undefinied variables which are
actualyl defined via set_fact), and we frequently run into problems
using it in this repo. It was designed to lint roles for Galaxy,
not the way we write ansible. As of the 5.0.0 release it's
generating >4.5K lines of complaints about files in this repository.
Change-Id: If9d8c19b5e663bdd6b6f35ffed88db3cff3d79f8
This adds a dockerfile to build an opendevorg/refstack image as well as
the jobs to build and publish it.
Change-Id: Icade6c713fa9bf6ab508fd4d8d65debada2ddb30
This runs selenium from a container on a node, and exposes port 4444
so you can issue commands to it. This is used in the follow-on
I56cda99790d3c172e10b664e57abeca10efc5566 to take some screenshots of
gerrit.
Change-Id: Idcbcd9a8f33bd86b5f3e546dd563792212e0751b
We don't need to test two servers in this test; remove review-dev.
Consensus seems to be this was for testing plans that have now been
superseded.
Change-Id: Ia4db5e0748e1c82838000c9b655808c3d8b74461
The hound project has undergone a small re-birth and moved to
https://github.com/hound-search/hound
which has broken our deployment. We've talked about leaving
codesearch up to gitea, but it's not quite there yet. There seems to
be no point working on the puppet now.
This builds a container than runs houndd. It's an opendev specific
container; the config is pulled from project-config directly.
There's some custom scripts that drive things. Some points for
reviewers:
- update-hound-config.sh uses "create-hound-config" (which is in
jeepyb for historical reasons) to generate the config file. It
grabs the latest projects.yaml from project-config and exits with a
return code to indicate if things changed.
- when the container starts, it runs update-hound-config.sh to
populate the initial config. There is a testing environment flag
and small config so it doesn't have to clone the entire opendev for
functional testing.
- it runs under supervisord so we can restart the daemon when
projects are updated. Unlike earlier versions that didn't start
listening till indexing was done, this version now puts up a "Hound
is not ready yet" message when while it is working; so we can drop
all the magic we were doing to probe if hound is listening via
netstat and making Apache redirect to a status page.
- resync-hound.sh is run from an external cron job daily, and does
this update and restart check. Since it only reloads if changes
are made, this should be relatively rare anyway.
- There is a PR to monitor the config file
(https://github.com/hound-search/hound/pull/357) which would mean
the restart is unnecessary. This would be good in the near and we
could remove the cron job.
- playbooks/roles/codesearch is unexciting and deploys the container,
certificates and an apache proxy back to localhost:6080 where hound
is listening.
I've combined removal of the old puppet bits here as the "-codesearch"
namespace was already being used.
Change-Id: I8c773b5ea6b87e8f7dfd8db2556626f7b2500473
Specifying the family stops a deprecation warning being output.
Add a HTML report and report it as an artifact as well; this is easier
to read.
Change-Id: I2bd6505c19cee2d51e9af27e9344cfe2e1110572
Builds running on the new container-based executors started failing to
connect to remote hosts with
Load key "/root/.ssh/id_rsa": invalid format
It turns out the new executor is writing keys in OpenSSH format,
rather than the older PEM format. And it seems that the OpenSSH
format is more picky about having a trailing space after the
-----END OPENSSH PRIVATE KEY-----
bit of the id_rsa file. By default, the file lookup runs an rstrip on
the incoming file to remove the trailing space. Turn that off so we
generate a valid key.
Change-Id: I49bb255f359bd595e1b88eda890d04cb18205b6e
This uses the Grafana container created with
Iddfafe852166fe95b3e433420e2e2a4a6380fc64 to run the
grafana.opendev.org service.
We retain the old model of an Apache reverse-proxy; it's well tested
and understood, it's much easier than trying to map all the SSL
termination/renewal/etc. into the Grafana container and we don't have
to convince ourselves the container is safe to be directly web-facing.
Otherwise this is a fairly straight forward deployment of the
container. As before, it uses the graph configuration kept in
project-config which is loaded in with grafyaml, which is included in
the container.
Once nice advantage is that it makes it quite easy to develop graphs
locally, using the container which can talk to the public graphite
instance. The documentation has been updated with a reference on how
to do this.
Change-Id: I0cc76d29b6911aecfebc71e5fdfe7cf4fcd071a4
We use ansible's to_nice_yaml output filter when writing ansible
datastructures to yaml. This has a default indent of 4, but we humans
usually write yaml with an indent of 2. Make the generated yaml more
similar to what us humans write and set the indent to 2.
Change-Id: I3dc41b54e1b6480d7085261bc37c419009ef5ba7
Tests that call host.backend.get_hostname() to switch on test
assertions are likely to fail open. Stop using this in zuul tests
and instead add new files for each of the types of zuul hosts
where we want to do additional verification.
Share the iptables related code between all the tests that perform
iptables checks.
Also, some extra merger test and some negative assertions are added.
Move multi-node-hosts-file to after set-hostname. multi-node-hosts-file
is designed to append, and set-hostname is designed to write.
When we write the gate version of the inventory, map the nodepool
private_ipv4 address as the public_v4 address of the inventory host
since that's what is written to /etc/hosts, and is therefore, in the
context of a gate job, the "public" address.
Change-Id: Id2dad08176865169272a8c135d232c2b58a7a2c1
Make inventory/service for service-specific things, including the
groups.yaml group definitions, and inventory/base for hostvars
related to the base system, including the list of hosts.
Move the exisitng host_vars into inventory/service, since most of
them are likely service-specific. Move group_vars/all.yaml into
base/group_vars as almost all of it is related to base things,
with the execption of the gerrit public key.
A followup patch will move host-specific values into equivilent
files in inventory/base.
This should let us override hostvars in gate jobs. It should also
allow us to do better file matchers - and to be able to organize
our playbooks move if we want to.
Depends-On: https://review.opendev.org/731583
Change-Id: Iddf57b5be47c2e9de16b83a1bc83bee25db995cf
The jitsi video bridge (jvb) appears to be the main component we'll need
to scale up to handle more users on meetpad. Start preliminary
ansiblification of scale out jvb hosts.
Note this requires each new jvb to run on a separate host as the jvb
docker images seem to rely on $HOSTNAME to uniquely identify each jvb.
Change-Id: If6d055b6ec163d4a9d912bee9a9912f5a7b58125
This adds a new variable for the iptables role that allows us to
indicate all members of an ansible inventory group should have
iptables rules added.
It also removes the unused zuul-executor-opendev group, and some
unused variables related to the snmp rule.
Also, collect the generated iptables rules for debugging.
Change-Id: I48746a6527848a45a4debf62fd833527cc392398
Depends-On: https://review.opendev.org/728952
Create a zuul_data fixture for testinfra.
The fixture directly loads the inventory from the inventory YAML file
written out. This lets you get easy access to the IP addresses of the
hosts.
We pass in the "zuul" variable by writing it out to a YAML file on
disk, and then passing an environment variable to this. This is
useful for things like determining which job is running. Additional
arbitrary data could be added to this if required.
Change-Id: I8adb7601f7eec6d48509f8f1a42840beca70120c
Rather than running a local zookeeper, just run a real zookeeper.
Also, get rid of nb01-test and just use nb04 - what could possibly
go wrong?
Dynamically write zookeeper host information to nodepool.yaml
So that we can run an actual zk using the new zk role on hosts in
ansible inventory, we need to write out the ip addresses of the
hosts that we build in zuul. This means having the info baked in
to the file in project-config isn't going to work.
We can do this in prod too, it shouldn't hurt anything.
Increase timeout for run-service-nodepool
We need to fix the playbook, but we'll do that after we get the
puppet gone.
Change-Id: Ib01d461ae2c5cec3c31ec5105a41b1a99ff9d84a
This sets up a robots.txt on our lists servers. To start this file
prevents SEMrush bot from indexing our lists as that has been causing
lists.openstack.org to OOM with many listinfo processes started by
Apache.
We've avoided this OOM by manually configuring this robots.txt. Other
things we have ruled out are bup and input email causes qrunner's to
grow unexpectedly large. Fairly confident this bot is the trigger.
Note this fixes testing by adding 'hieradata' to set listpassword var.
Depends-On: https://review.opendev.org/724389
Change-Id: Id4f6739a8cf6a01f9796fa54c86ba1af3e31fecf
As we add jobs that have more nodes in them, we need to make
sure we're running ansible with enough forks that the jobs
don't take forever.
Change-Id: I2b5bf55bd65eaf0fc2671f5379bd0cb5c3696f87