This moves review02 out of the review-staging group and into the main
review group. At this point, review01.openstack.org is inactive so we
can remove all references to openstack.org from the groups. We update
the system-config job to run against a focal production server, and
remove the unneeded rsync setup used to move data.
This additionally enables replication; this should be a no-op when
applied as part of the transition process is to manually apply this,
so that DNS setup can pull zone changes from opendev.org.
It also switches to the mysql connector, as noted inline we found some
issues with mariadb.
Note backups follow in a separate step to avoid doing too much at
once, hence dropping the backup group from the testing list.
Change-Id: I7ee3e3051ea8f3237fd5f6bf1dcc3e5996c16d10
The paste service needs an upgrade; since others have created a
lodgeit container it seems worth us keeping the service going if only
to maintain the historical corpus of pastes.
This adds the ansible to deploy lodgeit and a sibling mariadb
container. I have imported a dump of the old data as a test. The
dump is ~4gb and imported it takes up about double that; certainly
nothing we need to be too concerned over. The server will be more
than capable of running the db container alongside the lodgeit
instance.
This should have no effect on production until we decide to switch
DNS.
Change-Id: I284864217aa49d664ddc3ebdc800383b2d7e00e3
This adds a local mariadb container to the gerrit host to hold the
accountPatchReviewDb database. This is inspired by a few things
- since migration to NoteDB, there is only one table left where
Gerrit records what files have been reviewed for a change. This
logically scales with the number of reviews users are doing.
Pulling the stats on this, we can see since the NoteDB upgrade this
went from a very busy database (~300 queries/70 commits per second)
to barely registering one hit per second :
https://imgur.com/a/QGJV7Fw
Thus separating the db to an external host for performance reasons
is not a large concern any more.
- emperically we've done a bad job in keeping the existing hosted db
up-to-date; it's still running mysql 5.1 and we have been hit by
bugs such as the one referenced in-line which silently drops
backups.
- The other gerrit option is to use an on-disk H2 database. This is
certainly an option, however you need special tools to interact
with it for migration, etc. and it's not safe to backup from files
on disk (as opposed to mysqldump). Upstream advice is unclear, and
varies between H2 being a performance bottleneck to this being
ephemeral data that users don't care about. We know how to admin
mariadb/mysql and this allows us to migrate and backup data, so
seems like the best choice.
- we have a pressing need to update the server to a new operating
system. Running the db alongside the gerrit instance minimises
fiddling we have to do manging connections to and migrating the
hosted db systems.
- related to that, we are tending towards more provider independence
for control-plane servers. A hosted database product is not always
provided, so this gives us more flexibility in moving things
around.
- the main concern here is memory usage. "docker stats" reports a
quiescent container, freshly started on a 8GB host:
gerrit-compose_mariadb_1 67.32MiB
After loading a copy of the production table, and then dumping it
back to a file the same container reports:
gerrit-compose_mariadb_1 462.6MiB
The existing remote mysql configuration path remains mostly the same.
We move the gerrit startup into a script rather than a CMD so we can
call it after a "wait for db" script in the mariadb_container case
(this is the reccommeded way to enforce ordering [1]).
Backups of the local container need different dump commands; backups
are relocated to a new file and updated.
Testing is converted to use this rather than a local H2 database.
[1] https://docs.docker.com/compose/startup-order/
Change-Id: Iec981ef3c2e38889f91e9759e66295dbfb499c2e
Currently when we run tests, this connects to OFTC and tries to use
the opendevstatus nick as it is the default. Replace this with a
random username. Also override the channels list, so it only joins
Limnoria was already using a non-conflicting name, but switch it to a
random one for consistency and possible parallel running. This also
already only joins #opendev-sandbox.
Change-Id: I860b0f1ed4f99140dda0f4d41025f0b5fb844115
This installs statusbot on eavesdrop01.opendev.org.
Otherwise it's just config translation and bringing up the daemon.
Change-Id: I246b2723372594e65bcd1ba90215d6831d4c0c72
This enables the new eavesdrop01.opendev.org server in all current
channels. Puppet has been disabled on the old server and we will
manually stop supybot/meetbot and mirgrate logs before this applies.
Change-Id: I4a422bb9589c8a8761191313a656f8377e93422f
The ara-report role used to add this but it hasn't been updated for
the latest ARA (I008b35562994f1205a4f66e53f93b9885a6b8754). Add it
back here.
Change-Id: I2d56e7cde32cd7adabb359a35ecdaa9f0880f7d5
ARA's master branch now has static site generation, so we can move
away from the stable branch and get the new reports.
In the mean time ARA upstream has moved to github, so this updates the
references for the -devel job.
Depends-On: https://review.opendev.org/c/openstack/project-config/+/793530
Change-Id: I008b35562994f1205a4f66e53f93b9885a6b8754
We're trying to phase out the ELK systems. While we have agreed to not
immediately turn anything off we probably don't need to keep running the
system-config-legacy-logstash-filters job as ELK should remain fairly
fixed unless someone rewrites config management for it and modernizes
it. And if that happens they will want new modern testing too.
Depends-On: https://review.opendev.org/c/openstack/project-config/+/792710
Change-Id: I9ac6f12ec3245e3c1be0471d5ed17caec976334f
Gerrit's bazel rules are looking for python which doesn't exist on our
images. Add a python symlink to python3 until
https://gerrit-review.googlesource.com/c/gerrit/+/298903 is in a release,
which seems likely to be 3.5.
Change-Id: I1c15cceac1c9bbf435ed23bed7c1e3fe868f05ff
This converts our existing puppeted mailman configuration into a set of
ansible roles and a new playbook. We don't try to do anything new and
instead do our best to map from puppet to ansible as closely as
possible. This helps reduce churn and will help us find problems more
quickly if they happen.
Followups will further cleanup the puppetry.
Change-Id: If8cdb1164c9000438d1977d8965a92ca8eebe4df
This adds the new inmotion cloud to clouds.yaml files and the cloud
launcher config. This cloud is running on an openstack as a service
platform so we have quite a bit of freedom to make changes here within
the resource limitations if necessary.
Change-Id: I2aed6dffde4a1d6e3044c4bd8df4ca60065ae1ea
Otherwise you get
BadRequest: Expecting to find domain in project - the server could
not comply with the request since it is either malformed or otherwise
incorrect. The client is assumed to be in error.
Change-Id: If8869fe888c9f1e9c0a487405574d59dd3001b65
This matches the proposal in https://review.opendev.org/785972
It's safe to merge now (secret storage on bridge is updated) and get
ahead of the curve. It's harmless to add unused items.
Change-Id: I942ef5f95f9f1afe39b7d9a044276bfb338d6760
The Oregon State University Open Source Lab (OSUOSL;
https://osuosl.org/) has kindly donated some ARM64 resources. Add
initial cloud config.
Change-Id: I43ed7f0cb0b193db52d9908e39c04e351b3887e3
The OpenEdge cloud has been offline for five months, initially
disabled in I4e46c782a63279d9c18ff4ba2944c15b3027114b, so go ahead
and clean up lingering references. If it is restored later, this can
be reverted fairly easily.
Depends-On: https://review.opendev.org/783989
Depends-On: https://review.opendev.org/783990
Change-Id: I544895003344bc8202363993b52f978e1c07d061
With our increased ability to test in the gate, there's not much use
for review-dev any more. Remove references.
Change-Id: I97e9865e0b655cd157acf9ffa7d067b150e6fc72
When we cleaned up the puppet in
I6b6dfd0f8ef89a5362f64cfbc8016ba5b1a346b3 we renamed the group
s/refstack-docker/refstack/ but didn't move the variables and some
other references too.
Change-Id: Ib07d1e9ede628c43b4d5d94b64ec35c101e11be8
This adds a role and related testing to manage our Kerberos KDC
servers, intended to replace the puppet modules currently performing
this task.
This role automates realm creation, initial setup, key material
distribution and replica host configuration. None of this is intended
to run on the production servers which are already setup with an
active database, and the role should be effectively idempotent in
production.
Note that this does not yet switch the production servers into the new
groups; this can be done in a separate step under controlled
conditions and with related upgrades of the host OS to Focal.
Change-Id: I60b40897486b29beafc76025790c501b5055313d
The production server is trying to send itself to
refstack01.openstack.org, causing cross-site scripting issues. In
production, use the CNAME, but use the FQDN for testing.
Fix up job file matchers while here.
Change-Id: I18a5067ee25c59c5eaa17b7c2d9bd5a942a9173d
We have seen some poor performance from gitea which may be related to
manage project updates. Start a dstat service which logs to a csv file
on our system-config-run job hosts in order to collect performance info
from our services in pre merge testing. This will include gitea and
should help us evaluate service upgrades and other changes from a
performance perspective before they hit production.
Change-Id: I7bdaab0a0aeb9e1c00fcfcca3d114ae13a76ccc9
All hosts are now running thier backups via borg to servers in
vexxhost and rax.ord.
For reference, the servers being backed up at this time are:
borg-ask01
borg-ethercalc02
borg-etherpad01
borg-gitea01
borg-lists
borg-review-dev01
borg-review01
borg-storyboard01
borg-translate01
borg-wiki-update-test
borg-zuul01
This removes the old bup backup hosts, the no-longer used ansible
roles for the bup backup server and client roles, and any remaining
bup related configuration.
For simplicity, we will remove any remaining bup cron jobs on the
above servers manually after this merges.
Change-Id: I32554ca857a81ae8a250ce082421a7ede460ea3c
This change splits our existing system-config-run-review job into two
jobs, one for gerrit 3.2 and another for 3.3. The biggest change is that
we use a var called zuul_test_gerrit_version to select which version we
want and that ends up in the fake group file written out by Zuul for the
nested ansible run. The nested ansible run will then populate the
docker-compose file with the appropriate version for us.
Change-Id: I00b52c0f4aa8df3ecface964007fcf5724887e5e
It is buggy (throwing exceptions for undefinied variables which are
actualyl defined via set_fact), and we frequently run into problems
using it in this repo. It was designed to lint roles for Galaxy,
not the way we write ansible. As of the 5.0.0 release it's
generating >4.5K lines of complaints about files in this repository.
Change-Id: If9d8c19b5e663bdd6b6f35ffed88db3cff3d79f8
This adds a dockerfile to build an opendevorg/refstack image as well as
the jobs to build and publish it.
Change-Id: Icade6c713fa9bf6ab508fd4d8d65debada2ddb30
We modify the x/ route to ensure we can serve git repos from x/.
Previously we had been using sed which is likely to be much more fragile
than patch. Patch will detect conflicts and other errors which would be
good for us to find out about early.
Change-Id: Ic324c7777e7851a6150e4415338c4628ac710970
This installs the zuul-summary-results plugin into our gerrit
container. testinfra is updated to take a screenshot of the plugin in
action.
Change-Id: Ie0a165cc6ffc765c03457691901a1dd41ce99d5a
bazel likes to build everything in ~/.cache and then symlink bazel-*
"convience symlinks" in the workspace/build directory. This causes a
problem for building docker images where we run in the context of the
build directory; docker will not follow the symlinks out of build
directory.
Currently the bazelisk-build copies parts of the build to the
top-level; this means the bazelisk-build role is gerrit specific,
rather than generic as the name implies.
We modify the gerrit build step to break build output symlink and move
it into the top level of the build tree, which is the context the
docker build runs in later. Since this is now just a normal
directory, we can copy from it at will there.
This is useful in follow-on builds where we want to start copying more
than just the release.war file from the build tree, e.g. polygerrit
plugin output.
While we're here, remove the javamelody things that were only for 2.X
series gerrit, which we don't build any more.
[1] https://docs.bazel.build/versions/master/output_directories.html
Change-Id: I00abe437925d805bd88824d653eec38fa95e4fcd
Specify bazelisk_targets as a list, and join the targets as
space-separated in the build command. This is used in the follow-on
Ie0a165cc6ffc765c03457691901a1dd41ce99d5a.
While we are here, remove the build-gerrit.sh script that isn't used
any more, along with the step that installs it.
Also, refactor the tasks to use include_role (this is also used in the
follow on).
Change-Id: I4f3908e75cbbb7673135a2717f9e51f099a4860e
The "additional_plugins" variable is so different builds gerrit can
specify additional plugins specific to their version to install into
the base image.
Since we've moved to only building 3.2 and master images, a bunch of
plugins that used to be additional (because they weren't 2.XX era) are
now common. Move them into the common plugin code in the playbook,
and leave the only one different for master, the "checks" plugin, as
separate.
Change-Id: I8966ed7b5436fbe012486dccc1028bc8cb1cf9e4
By setting the auth type to DEVELOPMENT_BECOME_ANY_ACCOUNT and passing
--dev to the init process, gerrit will create an initial admin user
for us. We leverage this user to create a sample project, change,
Zuul user and sample CI result comment.
We also update testinfra to take some screenshots of gerrit and report
them back.
Change-Id: I56cda99790d3c172e10b664e57abeca10efc5566
This runs selenium from a container on a node, and exposes port 4444
so you can issue commands to it. This is used in the follow-on
I56cda99790d3c172e10b664e57abeca10efc5566 to take some screenshots of
gerrit.
Change-Id: Idcbcd9a8f33bd86b5f3e546dd563792212e0751b
We don't need to test two servers in this test; remove review-dev.
Consensus seems to be this was for testing plans that have now been
superseded.
Change-Id: Ia4db5e0748e1c82838000c9b655808c3d8b74461
This provides an HTML-only PolyGerrit plugin consistent with our
Gitea theming, generously provided by Paladox (many thanks!).
Since we have to split some roles in the build playbook, also name
the temporary patching role to make the build console a little
easier to read.
Change-Id: I3baf17d04b2dca34fc23dcab91c00544cedf0ca6
Gerrit 3.2 supports java 11 now and Gerrit 3.3 will be the last to
support java 8. Lets get ahead of things and switch to java 11.
Change-Id: I1b2f6b1bdadad10917ef5c56ce77f7d7cfc8625d
This should only land once we are on Gerrit 3.x and happy with it. But
at this point the mysql reviewdb will not be used anymore and config for
it can be removed. We keep general mysql things like tools and backups
in place as the accountPatchReviewDb continues to live in MySQL.
This also comments out calls to jeepyb's welcome-message,
update-blueprint and update-bug entrypoints from the patchset-created
event hook, since they rely on database connections for the moment.
Calls to update-bug in change-abandoned and change-merged event
hooks are retained as those code paths don't rely on database
interaction nor attempt to load the removed configuration.
Change-Id: I6e24dbb223fd3f76954db3dd74a03887cf2e2a8b
Gerrit seems to handle x/ for plugin extensions in polygerrit.
Unfortunately we've got projects called x/* and that breaks cloning of
these projects. Lets just avoid that for nwo until we can do a rename.
Change-Id: Id01739725c22af9d02ac30b1653743b49a35a332
The hound project has undergone a small re-birth and moved to
https://github.com/hound-search/hound
which has broken our deployment. We've talked about leaving
codesearch up to gitea, but it's not quite there yet. There seems to
be no point working on the puppet now.
This builds a container than runs houndd. It's an opendev specific
container; the config is pulled from project-config directly.
There's some custom scripts that drive things. Some points for
reviewers:
- update-hound-config.sh uses "create-hound-config" (which is in
jeepyb for historical reasons) to generate the config file. It
grabs the latest projects.yaml from project-config and exits with a
return code to indicate if things changed.
- when the container starts, it runs update-hound-config.sh to
populate the initial config. There is a testing environment flag
and small config so it doesn't have to clone the entire opendev for
functional testing.
- it runs under supervisord so we can restart the daemon when
projects are updated. Unlike earlier versions that didn't start
listening till indexing was done, this version now puts up a "Hound
is not ready yet" message when while it is working; so we can drop
all the magic we were doing to probe if hound is listening via
netstat and making Apache redirect to a status page.
- resync-hound.sh is run from an external cron job daily, and does
this update and restart check. Since it only reloads if changes
are made, this should be relatively rare anyway.
- There is a PR to monitor the config file
(https://github.com/hound-search/hound/pull/357) which would mean
the restart is unnecessary. This would be good in the near and we
could remove the cron job.
- playbooks/roles/codesearch is unexciting and deploys the container,
certificates and an apache proxy back to localhost:6080 where hound
is listening.
I've combined removal of the old puppet bits here as the "-codesearch"
namespace was already being used.
Change-Id: I8c773b5ea6b87e8f7dfd8db2556626f7b2500473
In converting this to ansible I forgot to install the reprepro keytab.
The encoded secret has been added for production.
Change-Id: I39d586e375ad96136cc151a7aed6f4cd5365f3c7
This will allow us to test further gerrit upgrades while we sort out how
far into the gerrit releases we will be upgrading to on our next
upgrade.
Change-Id: Ic9d07b76e41ad4262cc0e2e1ff8a5d554f88239e
The Apache 3081 proxy allows us to do layer 7 filtering on incoming
requests. However, it was returning 502 errors because it proxies to
https://localhost and the certificate doesn't match (see
SSLProxyCheckPeerName directive). However, we can't use the full
hostname in the gate because our self-signed certificate doesn't cover
that.
Add a variable and proxy to localhost in the gate, and the full
hostname in production. This avoids us having to turn off
SSLProxyCheckPeerName.
Change-Id: Ie12178a692f81781b848beb231f9035ececa3fd8
Collect the tox logs from the testinfra run on bridge.openstack.org.
The dependent change helps if we have errors installing things into
tox, and this change lets us see the results.
Depends-On: https://review.opendev.org/747325
Change-Id: Id3c39d4287d7dc9705890c73a230b1935d349b9f
In our beaker rspec testing we ssh into localhost pretending it is a
managed VM because that is how all the config management testing tools
want to work... This is has run into problems with new format ssh keys
which zuul provides. If such a key is present we convert it to PEM
othrewise we generate our own.
Also add ensure-virtualenv to the job as we appear to need it to run
these tests properly.
Change-Id: Ibb6080b5a321a6955866ef9b847c4d00da17f427
Change restart mode to always instead of 'no' as testing shows we won't
restart in a loop in CI and we want production to restart automatically.
Also add ssh pubkey contents for completeness and simplicity if we need
to find those in the future.
Change-Id: I81573a1ad1574419194eb3088070dda95fb81fff
This new ansible role deploys gerritbot with docker-compose on
eavesdrop.openstack.org. This way we can run it where the other bots
live.
Testing is rudimentary for now as we don't really want to connect to a
production gerrit and freenode. We check things the best we can.
We will want to coordinate deployment of this change with disabling the
running service on the gerrit server.
Depends-On: https://review.opendev.org/745240
Change-Id: I008992978791ff0a38f92fb4bc529ff643f01dd6
We need to add host (and possibly the ssh host key so its here too) in
this playbook because the add_host from the base-jobs side is only
applicable to the playbook running in base-jobs. When we start our
playbook here that state is lost. Simple fix, just add_host it again.
Change-Id: Iee60d04f0232500be745a7a8ca0eac4a6202063d
We can't run ARA on the executor because that involves running
arbitrary commands, instead generate reports on the executor and put
them where the normal fetch-output will find them later.
Change-Id: I20d88a7f03872d19f6bd014bc687a1bf16e4e80e
This uses a new base job which handles pushing the git repos on to
bridge since that must now happen in a trusted playbook.
Depends-On: https://review.opendev.org/742934
Change-Id: Ie6d0668f83af801c0c0e920b676f2f49e19c59f6
This adds roles to implement backup with borg [1].
Our current tool "bup" has no Python 3 support and is not packaged for
Ubuntu Focal. This means it is effectively end-of-life. borg fits
our model of servers backing themselves up to a central location, is
well documented and seems well supported. It also has the clarkb seal
of approval :)
As mentioned, borg works in the same manner as bup by doing an
efficient back up over ssh to a remote server. The core of these
roles are the same as the bup based ones; in terms of creating a
separate user for each host and deploying keys and ssh config.
This chooses to install borg in a virtualenv on /opt. This was chosen
for a number of reasons; firstly reading the history of borg there
have been incompatible updates (although they provide a tool to update
repository formats); it seems important that we both pin the version
we are using and keep clients and server in sync. Since we have a
hetrogenous distribution collection we don't want to rely on the
packaged tools which may differ. I don't feel like this is a great
application for a container; we actually don't want it that isolated
from the base system because it's goal is to read and copy it offsite
with as little chance of things going wrong as possible.
Borg has a lot of support for encrypting the data at rest in various
ways. However, that introduces the possibility we could lose both the
key and the backup data. Really the only thing stopping this is key
management, and if we want to go down this path we can do it as a
follow-on.
The remote end server is configured via ssh command rules to run in
append-only mode. This means a misbehaving client can't delete its
old backups. In theory we can prune backups on the server side --
something we could not do with bup. The documentation has been
updated but is vague on this part; I think we should get some hosts in
operation, see how the de-duplication is working out and then decide
how we want to mange things long term.
Testing is added; a focal and bionic host both run a full backup of
themselves to the backup server. Pretty cool, the logs are in
/var/log/borg-backup-<host>.log.
No hosts are currently in the borg groups, so this can be applied
without affecting production. I'd suggest the next steps are to bring
up a borg-based backup server and put a few hosts into this. After
running for a while, we can add all hosts, and then deprecate the
current bup-based backup server in vexxhost and replace that with a
borg-based one; giving us dual offsite backups.
[1] https://borgbackup.readthedocs.io/en/stable/
Change-Id: I2a125f2fac11d8e3a3279eb7fa7adb33a3acaa4e
Specifying the family stops a deprecation warning being output.
Add a HTML report and report it as an artifact as well; this is easier
to read.
Change-Id: I2bd6505c19cee2d51e9af27e9344cfe2e1110572
Builds running on the new container-based executors started failing to
connect to remote hosts with
Load key "/root/.ssh/id_rsa": invalid format
It turns out the new executor is writing keys in OpenSSH format,
rather than the older PEM format. And it seems that the OpenSSH
format is more picky about having a trailing space after the
-----END OPENSSH PRIVATE KEY-----
bit of the id_rsa file. By default, the file lookup runs an rstrip on
the incoming file to remove the trailing space. Turn that off so we
generate a valid key.
Change-Id: I49bb255f359bd595e1b88eda890d04cb18205b6e
This uses the Grafana container created with
Iddfafe852166fe95b3e433420e2e2a4a6380fc64 to run the
grafana.opendev.org service.
We retain the old model of an Apache reverse-proxy; it's well tested
and understood, it's much easier than trying to map all the SSL
termination/renewal/etc. into the Grafana container and we don't have
to convince ourselves the container is safe to be directly web-facing.
Otherwise this is a fairly straight forward deployment of the
container. As before, it uses the graph configuration kept in
project-config which is loaded in with grafyaml, which is included in
the container.
Once nice advantage is that it makes it quite easy to develop graphs
locally, using the container which can talk to the public graphite
instance. The documentation has been updated with a reference on how
to do this.
Change-Id: I0cc76d29b6911aecfebc71e5fdfe7cf4fcd071a4
This adds an option to have an Apache based reverse proxy on port 3081
forwarding to 3000. The idea is that we can use some of the Apache
filtering rules to reject certain traffic if/when required.
It is off by default, but tested in the gate.
Change-Id: Ie34772878d9fb239a5f69f2d7b993cc1f2142930
We use ansible's to_nice_yaml output filter when writing ansible
datastructures to yaml. This has a default indent of 4, but we humans
usually write yaml with an indent of 2. Make the generated yaml more
similar to what us humans write and set the indent to 2.
Change-Id: I3dc41b54e1b6480d7085261bc37c419009ef5ba7
In prep-apply we're assuing virtualenv which is not there. Now
that the nodes don't have it by default, this breaks. Add it.
Change-Id: I07a392f5bcbf4d5f04d8812d5c712d2fcc60747b
We can't establish Gerrit or Github connections in the gate, so
Zuul fails to start. Reducing the set of connections in the gate
to just smtp should allow it to start (albiet with tenant loading
errors). But that should let us test basic system setup and
internal connectivity.
Change-Id: I39d648ac5dd6ee3e9bfbc026cd6d7142461c418c
This exports Rackspace DNS domains to bind format for backup and
migration purposes.
This installs a small tool to query and export all the domains we can
see via the Racksapce DNS API.
Because we don't want to publish the backups (it's the equivalent of a
zone xfer) it is run on, and logs output to, bridge.openstack.org from
cron once a day.
Change-Id: I50fd33f5f3d6440a8f20d6fec63507cb883f2d56
Tests that call host.backend.get_hostname() to switch on test
assertions are likely to fail open. Stop using this in zuul tests
and instead add new files for each of the types of zuul hosts
where we want to do additional verification.
Share the iptables related code between all the tests that perform
iptables checks.
Also, some extra merger test and some negative assertions are added.
Move multi-node-hosts-file to after set-hostname. multi-node-hosts-file
is designed to append, and set-hostname is designed to write.
When we write the gate version of the inventory, map the nodepool
private_ipv4 address as the public_v4 address of the inventory host
since that's what is written to /etc/hosts, and is therefore, in the
context of a gate job, the "public" address.
Change-Id: Id2dad08176865169272a8c135d232c2b58a7a2c1
Make inventory/service for service-specific things, including the
groups.yaml group definitions, and inventory/base for hostvars
related to the base system, including the list of hosts.
Move the exisitng host_vars into inventory/service, since most of
them are likely service-specific. Move group_vars/all.yaml into
base/group_vars as almost all of it is related to base things,
with the execption of the gerrit public key.
A followup patch will move host-specific values into equivilent
files in inventory/base.
This should let us override hostvars in gate jobs. It should also
allow us to do better file matchers - and to be able to organize
our playbooks move if we want to.
Depends-On: https://review.opendev.org/731583
Change-Id: Iddf57b5be47c2e9de16b83a1bc83bee25db995cf
The existing test gearman cert+key combos were mismatched and therefore
invalid. This replaces them with newly generated test data, and moves
them into the test private hostvar files where the production private
data are now housed.
This removes the public production data as well; those certs are now
in the private hostvar files.
Change-Id: I6d7e12e2548f4c777854b8738c98f621bd10ad00
The jitsi video bridge (jvb) appears to be the main component we'll need
to scale up to handle more users on meetpad. Start preliminary
ansiblification of scale out jvb hosts.
Note this requires each new jvb to run on a separate host as the jvb
docker images seem to rely on $HOSTNAME to uniquely identify each jvb.
Change-Id: If6d055b6ec163d4a9d912bee9a9912f5a7b58125
This adds a new variable for the iptables role that allows us to
indicate all members of an ansible inventory group should have
iptables rules added.
It also removes the unused zuul-executor-opendev group, and some
unused variables related to the snmp rule.
Also, collect the generated iptables rules for debugging.
Change-Id: I48746a6527848a45a4debf62fd833527cc392398
Depends-On: https://review.opendev.org/728952
This autogenerates the list of ssl domains for the ssl-cert-check tool
directly from the letsencrypt list.
The first step is the install-certcheck role that replaces the
puppet-ssl_cert_check module that does the same. The reason for this
is so that during gate testing we can test this on the test
bridge.openstack.org server, and avoid adding another node as a
requirement for this test.
letsencrypt-request-certs is updated to set a fact
letsencrypt_certcheck_domains for each host that is generating a
certificate. As described in the comments, this defaults to the first
host specified for the certificate and the listening port can be
indicated (if set, this new port value is stripped when generating
certs as is not necessary for certificate generation).
The new letsencrypt-config-certcheck role runs and iterates all
letsencrypt hosts to build the final list of domains that should be
checked. This is then extended with the
letsencrypt_certcheck_additional_domains value that covers any hosts
using certificates not provisioned by letsencrypt using this
mechanism.
These additional domains are pre-populated from the openstack.org
domains in the extant check file, minus those openstack.org domain
certificates we are generating via letsencrypt (see
letsencrypt-create-certs/handlers/main.yaml). Additionally, we
update some of the certificate variables in host_vars that are
listening on port !443.
As mentioned, bridge.openstack.org is placed in the new certcheck
group for gate testing, so the tool and config file will be deployed
to it. For production, cacti is added to the group, which is where
the tool currently runs. The extant puppet installation is disabled,
pending removal in a follow-on change.
Change-Id: Idbe084f13f3684021e8efd9ac69b63fe31484606
Create a zuul_data fixture for testinfra.
The fixture directly loads the inventory from the inventory YAML file
written out. This lets you get easy access to the IP addresses of the
hosts.
We pass in the "zuul" variable by writing it out to a YAML file on
disk, and then passing an environment variable to this. This is
useful for things like determining which job is running. Additional
arbitrary data could be added to this if required.
Change-Id: I8adb7601f7eec6d48509f8f1a42840beca70120c
The install-nodejs role in zuul-jobs has been replaced by
ensure-nodejs, so we should use the new thing if we want our tests
running again.
Change-Id: I196814b616d3b332b2c1d397097c01b5bb0d2aac
Rather than running a local zookeeper, just run a real zookeeper.
Also, get rid of nb01-test and just use nb04 - what could possibly
go wrong?
Dynamically write zookeeper host information to nodepool.yaml
So that we can run an actual zk using the new zk role on hosts in
ansible inventory, we need to write out the ip addresses of the
hosts that we build in zuul. This means having the info baked in
to the file in project-config isn't going to work.
We can do this in prod too, it shouldn't hurt anything.
Increase timeout for run-service-nodepool
We need to fix the playbook, but we'll do that after we get the
puppet gone.
Change-Id: Ib01d461ae2c5cec3c31ec5105a41b1a99ff9d84a
This sets up a robots.txt on our lists servers. To start this file
prevents SEMrush bot from indexing our lists as that has been causing
lists.openstack.org to OOM with many listinfo processes started by
Apache.
We've avoided this OOM by manually configuring this robots.txt. Other
things we have ruled out are bup and input email causes qrunner's to
grow unexpectedly large. Fairly confident this bot is the trigger.
Note this fixes testing by adding 'hieradata' to set listpassword var.
Depends-On: https://review.opendev.org/724389
Change-Id: Id4f6739a8cf6a01f9796fa54c86ba1af3e31fecf
As we add jobs that have more nodes in them, we need to make
sure we're running ansible with enough forks that the jobs
don't take forever.
Change-Id: I2b5bf55bd65eaf0fc2671f5379bd0cb5c3696f87
The intent of the periodic jobs is to run with latest master. If
they get enqueued, then other patches land, they'll still run with
the value of the zuul ref from when they were enqueued. That's not
what we want for prod, as it can lead to running old versions of
config.
We don't usually like doing this, but in this case, rather than
making us remember to add a flag every time a prod job gets added
to a periodic pipeline, how's about we just calculate it.
Change-Id: Ib999731fe132b1e9f197e51d74066fa75cb6c69b
We don't want to HUP all the processes in the container, we just
want zuul to reconfigure. Use the smart-reconfigure command.
Also - start the scheduler in the gate job.
Change-Id: I66754ed168165d2444930ab1110e95316f7307a7
Zuul is publishing lovely container images, so we should
go ahead and start using them.
We can't use containers for zuul-executor because of the
docker->bubblewrap->AFS issue, so install from pip there.
Don't start any of the containers by default, which should
let us safely roll this out and then do a rolling restart.
For things (like web or mergers) where it's safe to do so,
a followup change will swap the flag.
Change-Id: I37dcce3a67477ad3b2c36f2fd3657af18bc25c40
Extract eavedrop into its own service playbook and
puppet manifest. While doing that, stop using jenkinsuser
on eavesdrop in favor of zuul-user.
Add the ability to override the keys for the zuul user.
Remove openstack_project::server, it doesn't do anything.
Containerize and anisblize accessbot. The structure of
how we're doing it in puppet makes it hard to actually
run the puppet in the gate. Run the script in its own
playbook so that we can avoid running it in the gate.
Change-Id: I53cb63ffa4ae50575d4fa37b24323ad13ec1bac3
In launch-node, we run two playbooks that aren't part of base.
One sets the system's hostname and removes cloud-init, the other
runs unattended update.
We need to run the hostname setting in our functional tests so
that the hosts behave as expected, but running the cloud-init
removal is a little weird, since our test nodes already don't
have it.
Make it so that set-hostname actually just sets the hostname,
and then run it in run-base. For running puppet, we need the
host to have the correct hostname.
Move cloud-init removal to the base-server role. Also move
the autoremove into base-server, since it's probably a nice
way to get rid of excess things.
Change-Id: I53cb8c515444a7d73b839e799c5794b067429daa
These use legacy-base, which sucks, but what sucks even more is
that they are in openstack-zuul-jobs, which makes them extra
awkward to try to adjust.
Change-Id: I87b3d56de41f0ba5658c1240ddfc7ecf1c3c43af
Pass the ansible_host variable explicitly to mirror-workspace-git-repos
because for some reason it's confused and getting localhost.
Change-Id: I8a30b98a6eef168d11d4d580de359546ee1da252
Put this in in the last patch without a specific need to. But
then we're getting an error. Because of course we are.
Change-Id: I5c982af2e1ba09a78162b2786e31f541247fce21
The mirror-workspace-git role expects things like ansible_port to
be set, but we're not producing them in our add_host command.
Change-Id: Ib80062736e91f8d1471a42edecdebb449f073927
We use project-config for gerrit, gitea and nodepool config. That's
cool, because can clone that from zuul too and make sure that each
prod run we're doing runs with the contents of the patch in question.
Introduce a flag file that can be touched in /home/zuulcd that will
block zuul from running prod playbooks. By default, if the file is
there, zuul will wait for an hour before giving up.
Rename zuulcd to zuul
To better align prod and test, name the zuul user zuul.
Change-Id: I83c38c9c430218059579f3763e02d6b9f40c7b89
We don't have cached repos, and our repos aren't so big
that we want to care about the git push difference.
Also - dont do delete: true like prepare-workspace does,
because deleting and then re-pushing project-config depending
on job would be costly.
Change-Id: I4c7bbc797f9f81878424b7bf2b7e83ec756de108
Instead of running from system-config, run from the zuul prepared
git repo state. We already have a mutex of one, so we'll never
be fighting. This lets us land stacks of changes and be sure they
will accurately always use the correct git state.
As a todo, we should update manage-projects to do the same with
project-config.
Change-Id: I358554e344e12561d1f3063e0724f6b61d1f15a7
So that we can start running things from the zuul source rather
thatn update-system-config and /opt/system-config, we need to
install a few things onto the host in install-ansible so that the
ansible env is standalone.
This introduces a split execution path. The ansible config is
now all installed globally onto the machine by install-ansible
and does not reference a git checkout.
For running ad-hoc commands, an ansible.cfg is introduced inside
the root of the system-config dir. So if ansible-playbook is
executed with PWD==/opt/system-config it will find that ansible.cfg,
it will take precedence, and any content from system-config
will take precedence.
As a followup we'll make /opt/system-config/ansible.cfg written
out by install-ansible from the same template, and we'll update
the split to make ansible only work when executed from one of
the two configured locations, so that it's clear where we're
operating from.
Change-Id: I097694244e95751d96e67304aaae53ad19d8b873
We are writing to /var/log/ansible which needs root perms. This was not
done and the writes failed. Fix that.
Change-Id: Ibe93519f2f549e85f0e238a210999c6281f42ce6
This updates prod playbook jobs to curate a set of logs on bridge if we
aren't publishing them to zuul. This way we have history on the bastion
server.
Change-Id: I73889754155298a8554ddc17bb413ae7764b9eae
Upstream likes building the settings file into the image, but that's
less exciting, let's bind-mount ours in.
Depends-On: https://review.opendev.org/717491/
Change-Id: Ia1894d884ef2a84e1282345b77fe07bf8898f367
More importantly, put the log collection in an always
section of block, otherwise we won't get logs if a
playbook fails, which is pretty much exactly when we
want to get logs.
Change-Id: Ia8e581e522f75a5f5945bc2143eec63b93381a94
We have a bridge.yaml and a service-bridge.yaml and it keeps
being confusing. Rename bridge.yaml to install-ansible.yaml to make
it clear what it is that it actually does.
Add a soft-depend on it for manage-projects, because if
something updates with the ansible config, we want it to
happen before running manage-projects.
Change-Id: Ia7c8dd0e32b2c4aaa674061037be5ab66d9a3581
We need to log to a file and then collect it to the log output on
zuul. Default to true to that steady-state reads nicely. When we
add new jobs we should make sure to set to false first so that we
can vett the output before publishing it.
Change-Id: Ia4f759b82a5fff6e36e4284c11281254c0d5627d
For our rollout, we need to be able to run this without actually
running the up.
Also, split out startup tasks so that we can run them from a
dedicated start playbook by themselves.
Change-Id: I08d994e496fbd8d5adbfa1ce344b0ae52f46535c
Sister change for Ia5caff34d3fafaffc459e7572a4eef6bd94422ea and
removing earlier references to the mirror server in preparation for
building and adding the new one.
Change-Id: I7d506be85326835d5e77a0c9c461f2d457b1dfd3
This adds a simple role to install Zookeeper.
Add an option to nodepool-base to use this role to install Zookeeper.
Use this in the nodepool-builder gate testing where we are just
validating that the nodepool-builder container starts and is ready to
accept connections. It needs a zookeeper to talk to, even though it
is not going to do anything.
Change-Id: I4ae89a51e454be4ee53ad4e04407162aaa8d9f9a
When testing our system-conf configuration we don't actually add zuul to
the docker group. This means the zuul user cannot access the docker
socket. This then breaks docker container log collection. Address this
by becoming root when collecting logs.
Change-Id: Ic0232f7ef458cdd07fb0853f97f2dc22ce137c71
Currently we don't set a contact email with our accounts. This is an
optional feature, but would be helpful for things like [1] where we
would be notified of certificates affected by bugs, etc.
Setup the email address in the acme.sh config which will apply with
any new accounts created. To update all the existing hosts, we see if
the account email is added/modified in the config *and* if we have
existing account details; if so we need a manual update call.
For anyone who might be poking here, we also add a note on sharing an
account based on some broadly agreed upon discussion in IRC.
[1] https://community.letsencrypt.org/t/revoking-certain-certificates-on-march-4/114864
Change-Id: Ib4dc3e179010419a1b18f355d13b62c6cc4bc7e8
We need to use bazelisk to build gerrit so that we can properly
track bazel versions in the job. Use the roles developed for
gerrit-review to do that, then simplify the dockerfile to have
it simply copy the war into the target image.
Also add polymer-bridges.
Depends-On: https://review.opendev.org/709256
Change-Id: I7c13df51d3b8c117bcc9aab9caad59687471d622
This is a new cloud provided via citycloud that will add resources
capable of running Airship jobs. The goal is to use this as a stepping
stone to having Airship jobs run on our generic CI resources. This cloud
will provide both generic and larger resources to support this.
Change-Id: I63fd9023bc11f1382424c8906dc306cee5b3f58d
As a follow-on to Ie37abb4fd3eb3342b66ade52ab65024c420d7264 remove the
linaro credentials that were related to the (now removed) linaro-cn1
cloud.
Change-Id: Ia1e8dd3732164708c2e9fd82509e350829c438ba
This was missed when converting the registry server over to LE in
production. We need to test it this way too.
Change-Id: Ic2a05ebeae6991b69c000d5269165a45a0c72d38
This change switches the post bits to use a new centralized
role to collect all container logs.
Depends-On: https://review.opendev.org/701867
Change-Id: I9e982b37518c22e6d5358f7604ebc7f56b0626e3
While we're in there - fix a misspelling.
Remove auth.restTokenPrivateKey from config file. It hasn't been
used since 2.6: https://gerrit-review.googlesource.com/c/gerrit/+/70770
Change-Id: I94405cf870d57780b86f30c2bddb573ff15c05bc
NOTE: We should update storyboard-dev to be driven by
letsencrypt first, otherwise we need to plumb in the
self-signed cert, which gets weird with needing to
import it for java which in this case is in the container
image, meaning we either need to bind-mount java certs in
or build it in to the image.
Change-Id: Ida9dd15ca8262925c54579660fe9c16e2b573907
For gate testing we need the smaller AFS cache size applied to
everything that might install openafs, not just the mirror nodes.
Move the definition to the afs-client group.
Change-Id: Id27efd2f12f5ac3f351f65fa1ae513624a53df90
This is the first step in managing the opendev.org cert with LE. We
modify gitea01.opendev.org only to request the cert so that if this
breaks the other 7 giteas can continue to serve opendev.org. When we are
happy with the results we can merge the followup change to update the
other 7 giteas.
Depends-On: https://review.opendev.org/694182
Change-Id: I9587b8c2896975aa0148cc3d9b37f325a0be8970
This runs gerrit in a container on review-dev01 using podman.
Remove an unused web_server.py file that we found from copying it
from puppet to ansible.
Change-Id: I399d3cf8471bc8063022b0db0ff81718b2ee2941
We'll use this to test the checks plugin.
We have to add jgit as a repo because it's a submodule now.
Change-Id: Ic7e9ad0265e136a9ac6b1147998f6eb5ee398180
A few things have changed and we need to fix them in one go.
Use mirror for installing docker for buildset-registry
While, we need to make this more systemic, that's hanging off of the
mirror rework. For now, since we know all of these jobs are debian
based, just set the mirror location.
Replace use of zuul cloner with git clones
You can never be a prophet in your own hometown. This is now broken
because of the git cache rework, so just replace it.
Update libjemalloc library
python:slim is based on buster now, which has libjemalloc2 not
libjemalloc1.
Remove gerrit repo remote for submodules
A recent change to the base jobs to use prepare-workspace-git
broke the gerrit image builds by actually having the origin
remote by /dev/null as intended. This breaks submodules because
for a few of them where we don't have matching stable branches
the submodule relative path behavior is actually exactly what
we want.
Since we don't care about the remote otherwise, remove the
origin remote before doing the submodule update --init so that
the submodule will clone the refs from the zuul prepared repo.
Change-Id: Ieb5b6bc8711fe971ed3445c7c267306ac4616464
An upcoming change will add JWT authentication to the registry;
prepare for that by establishing a server-side secret for use
in signing the tokens.
Change-Id: Ibaa15dd0c4b0d797f01a1886186fdc021dc990fa
Use latest bazel
It seems 0.27 is now too old. This is what happens when I go on vacation
apparently.
Add in a hack to override the bazelversion. We'll remove this once
https://gerrit-review.googlesource.com/c/gerrit/+/237495 lands and
has been merged up.
Change-Id: Ib7a6d33ce8bf8498fd5cd09b25087dc09acb8df4
Setting this to system-config allows us to run the base tests as 3rd
party ci for projects like testinfra.
Change-Id: I2d15df154dcdc7c5da6c3326fbecec2146201164
We had some extra bazel options that don't seem to be necessary
anymore now that we are using upstream bazel options appropriately.
Retry the build a couple of times if it goes south, inside of the
build image. This should allow re-use of the cache the second time,
and if there is a temporary error, it should pick up and move
forward.
Change-Id: I5f304acb21fd3a4d40701fc0414ae0c424c838e5
This introduces two new roles for managing the backup-server and hosts
that we wish to back up.
Firstly the "backup" role runs on hosts we wish to backup. This
generates and configures a separate ssh key for running bup and
installs the appropriate cron job to run the backup daily.
The "backup-server" job runs on the backup server (or, indeed
servers). It creates users for each backup host, accepts the remote
keys mentioned above and initalises bup. It is then ready to receive
backups from the remote hosts.
This eliminates a fairly long-standing requirement for manual setup of
the backup server users and keys; this section is removed from the
documentation.
testinfra coverage is added.
Change-Id: I9bf74df351e056791ed817180436617048224d2c
Our goal is upgrading to 3.0. To do that we need to upgrade to 2.15, then
to 2.16, then to 3.0. Build all of the images so that we can do that.
2.16 and 3.0 also use bazel, so just use one copy of the Dockerfile for
all three and let zuul check out the repos to the right versions.
Depends-On: https://review.opendev.org/673147
Depends-On: https://review.opendev.org/672320
Change-Id: I35bd278e0c70c871fa44d005c60a987d1d8e3cdc
Add new IP addresses to inventory for the rebuild, but don't
reactivate it in the haproxy pools yet.
Note this switches the gitea testing to use a host called gitea99 so
that it doesn't conflict with our changes of the production hosts.
Change-Id: I9779e16cca423bcf514dd3a8d9f14e91d43f1ca3
This takes a similar approach to the extant ansible_cron_install_cron
variable to disable the cron job for the cloud launcher when running
under CI.
If you happen to have your CI jobs when the cron job decides to fire,
you end up with a harmless but confusing failed run of the cloud
launcher (that has tried to contact real clouds) in the ARA results.
Use the "disbaled" flag to ensure the cron job doesn't run. Using
"disabled" means we can still check that the job was installed via
testinfra however.
Convert ansible_cron_install_cron to a similar method using disable,
document the variable in the README and add a test for the run_all.sh
script in crontab too.
Change-Id: If4911a5fa4116130c39b5a9717d610867ada7eb1
Zuul now includes an ansible_python_interpreter hostvar in every
host in its inventory. It defaults to python2. The write-inventory
role, which takes the Zuul inventory and makes an inventory for
the fake bridge server in the gate passes that through. Because it's
in /etc/ansible/inventory.yaml, it overrides any settings which may
arrive via group vars, but this is the way we set the interpreter
for all the hosts on bridge (we do not do so in the actual inventory
file).
To correct this, tell write-inventory to strip the
ansible_python_interpreter variable when it writes out the new
inventory. This restores the behavior to match what happens on
the real bridge host. One instance of setting the interpreter
for the fake "trusty" host used in base platform tests is moved to
a hostvars file to match the rest of the real hosts.
Change-Id: I60f0acb64e7b90ed8af266f21f2114fd598f4a3c
This adds a periodic job to copy logs to a mirror volume, and export
it via the usual mirror http.
I have precreated the log volume; just as a R/W volume because this is
expected to be very low volume access.
Change-Id: I67870f6d439af2d2a63a5048ef52cecff3e75275
Keytabs are slightly longer than what is being tested; upto 100 bytes
or so. This means the encoded data breaks over lines, which means you
need to be more careful about quoting.
Update the testing to a longer keytab (100 bytes of random data) and
fix up the quoting. Also enable no_logging to avoid putting key
material into the logs.
Change-Id: I73c391a2ebd2c962dc9a422f9d44265160210852
This move was prompted by wishing to expose the mirror update logs for
the rsync updates so that debugging problems does not require a root
user (note: not actually done in this change; will be a follow-on).
Rather than start hacking at puppet, the rsync mirror scripts make a
nice delination point for starting an Ansible-first/Bionic update.
Most magic is included in the scripts, so there is not much more to do
than copy them. The host uses the existing kerberos and openafs roles
and copies the key material into place (to be added before merge).
Note the scripts are removed from the extant puppet so we don't have
two updates happening simultaneously. This will also require a manual
clean to remove the cron jobs as a once-off when merging.
The other part of mirror-update is the reprepro based scripts for the
various debuntu repositories. They are left as future work for now.
Testing is added to ensure dependencies and scripts are all in place.
Change-Id: I525ac18b55f0e11b0a541b51fa97ee5d6512bf70
Donnyd has kindly offered us access to fortnebula's test cloud. This
adds clouds.yaml entries to bridge and nodepool so that we can take
advantage of these resources.
Change-Id: I4ebc261c6f548aca0b3f37dc9b60ffac08029e67
This is an intermediate step to having both kafs and openafs testing
in the gate; this just makes it clear which host is which.
Change-Id: I8cd006227ed47ad5f2c5eec664083477dd7ba397
In a follow-on change (I9bf74df351e056791ed817180436617048224d2c) I
want to use #noqa to ignore an ansible-lint rule on a task; however
emperical testing shows that it doesn't work with 3.5.1. Upgrading to
4.1.0 it seems whatever was wrong has been fixed.
This, however, requires upgrading to 4.1.0.
I've been through the errors ... the comments inline I think justify
what has been turned off. The two legitimate variable space issues I
have rolled into this change; all other hits were false positives as
described.
Change-Id: I7752648aa2d1728749390cf4f38459c1032c0877