How we got here - I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 split the
log collection into a post-run job so we always collect logs, even if
the main run times out. We then realised in
Ic18c89ecaf144a69e82cbe9eeed2641894af71fb that the log timestamp fact
doesn't persist across playbook runs and it's not totally clear how
getting it from hostvars interacts with dynamic inventory.
Thus take an approach that doesn't rely on passing variables; this
simply pulls the time from the stamp we put on the first line of the
log file. We then use that to rename the stored file, which should
correspond more closely with the time the Zuul job actually started.
To further remove confusion when looking at a lot of logs, reset the
timestamps to this time as well.
Change-Id: I7a115c75286e03b09ac3b8982ff0bd01037d34dd
Update the Gerrit upgrade job to check for new on disk h2 cache files.
We discovered well after the fact that Gerrit 3.5 added new (large)
cache files to disk that would've been good to be aware of prior to the
upgrade. This change will check for new files and produce a message if
they exist.
Change-Id: I4b52f95dd4b23636c0360c9960d84bbed1a5b2d4
When this moved with I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 we lost
access to the variable set as a fact; regenerate it. In a future
change we can look at strategies to share this with the start
timestamp (not totally simple as it is across playbooks on a
dynamicaly added host).
Change-Id: Ic18c89ecaf144a69e82cbe9eeed2641894af71fb
I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 added this to ensure that we
always collect logs. However, since this doesn't have bridge
dynamically defined in the playbook, it doesn't run any of the steps.
On the plus side, it doesn't error either.
Change-Id: I97beecbc48c83b9dea661a61e21e0d0d29ca4733
If the production playbook times out, we don't get any logs collected
with the run. By moving the log collection into a post-run step, we
should always get something copied to help us diagnose what is going
wrong.
Change-Id: I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3
ansible_date_time is actually the cached fact time that has little
bearing on the actual time this is running [1] -- which is what you
want to see when, for example, tracing backwards to see why some runs
are randomly timing out.
[1] https://docs.ansible.com/ansible/latest/user_guide/playbooks_vars_facts.html#ansible-facts
Change-Id: I8b5559178e29f8604edf6a42507322fc928afb21
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
Change-Id: Ifd5e270604102806736dffa86dff2bf8b23799c5
To make testing more like production, copy the OpenDev CA into the
haproxy container configuration directory during Zuul runs. We then
update the testing configuration to use SSL checking like production
does with this cert.
Change-Id: I1292bc1aa4948c8120dada0f0fd7dfc7ca619afd
Some of our testing makes use of secure communication between testing
nodes; e.g. testing a load-balancer pass-through. Other parts
"loop-back" but require flags like "curl --insecure" because the
self-signed certificates aren't trusted.
To make testing more realistic, create a CA that is distributed and
trusted by all testing nodes early in the Zuul playbook. This then
allows us to sign local certificates created by the letsencrypt
playbooks with this trusted CA and have realistic peer-to-peer secure
communications.
The other thing this does is reworks the letsencrypt self-signed cert
path to correctly setup SAN records for the host. This also improves
the "realism" of our testing environment. This is so realistic that
it requires fixing the gitea playbook :). The Apache service proxying
gitea currently has to override in testing to "localhost" because that
is all the old certificate covered; we can now just proxy to the
hostname directly for testing and production.
Change-Id: I3d49a7b683462a076263127018ec6a0f16735c94
We have moved to a situation where we proxy requests to gitea (3000)
via Apache listening on 3081 -- this is useful for layer 7 filtering
like matching on user-agents.
It seems like we missed some of this configuration in our
load-balancer testing. Update the https forward on the load-balancer
to port 3081 on the gitea test host.
Also, remove the explicit port opening in the testing group_vars; for
some reason this was not opening port 3080 (http). This will just use
the production settings when we don't override it.
Change-Id: Ic5690ed893b909a7e6b4074a1e5cd71ab0683ab4
This adds upgrade testing from our current Gerrit version (3.5) to the
likely future version of our next upgrade (3.6).
To do so we have to refactor the gerrit testing becase the 3.5 to 3.6
upgrade requires we run a command against 3.5. The previous upgrade
system assumed the old version could be left alone and jumped straight
into the upgrade finally testing the end state. Now we have split up the
gerrit bootstrapping and gerrit testing so that normal gerrit testing
and upgrade testing can run these different tasks at different points in
the gerrit deployment process.
Now the upgrade tests use the bootstrapping playbook to create users,
projects, and changes on the old version of gerrit before running the
copy-approvals command. Then after the upgrade we run the test assertion
portion of the job.
Change-Id: Id58b27e6f717f794a8ef7a048eec7fbb3bc52af6
We previously auto updated nodepool builders but not launchers when new
container images were present. This created confusion over what versions
of nodepool opendev is running. Use the same behavior for both services
now and auto restart them both.
There is a small chance that we can pull in an update that breaks things
so we run serially to avoid the most egregious instances of this
scenario.
Change-Id: Ifc3ca375553527f9a72e4bb1bdb617523a3f269e
This updates the gerrit configuration to deploy 3.5 in production.
For details of the upgrade process see:
https://etherpad.opendev.org/p/gerrit-upgrade-3.5
Change-Id: I50c9c444ef9f798c97e5ba3dd426cc4d1f9446c1
This section runs as root, but the system-config repo is cloned as
Zuul. This causes a problem for tox, when it installs it calls out to
git which is no longer operates in the directory [1].
[1] 8959555cee
Change-Id: I5c67208c025d29435dcc40c5eeb3b3aa8e5c4d5d
Because "." is a field separator for graphite, we're incorrectly
nesting the results.
A better idea seems to be to store these stats under the job name.
That's going to be more helpful when looking up in Zuul build results
anyway.
Follow-on to I90dfb7a25cb5ab08403c89ef59ea21972cf2aae2
Change-Id: Icbb57fd23d8b90f52bc7a0ea5fa80f389ab3892e
We used to track the runtime with the old cron-based system
(I299c0ab5dc3dea4841e560d8fb95b8f3e7df89f2) and had a dashboard view,
which was often helpful to see at a glance what might be going wrong.
Restore this for Zuul CD by simply sending the nested-Ansible task
time-delta and status to graphite. bridge.openstack.org is still
allowed to send stats to graphite from this prior work, so no ports
need to be opened.
Change-Id: I90dfb7a25cb5ab08403c89ef59ea21972cf2aae2
As found in Ie5d55b2a2d96a78b34d23cc6fbac62900a23fc37, the default for
this is to issue "OPTIONS /" which is kind of a weird request. The
Zuul hosts currently seem to return the main page content in response
to a OPTIONS request, which probably isn't right.
Make this more robust by just using "HEAD /" request.
Change-Id: Ibbd32ae744af9c33aedd087a8146195844814b3f
Apparently the check-ssl option only modifies check behavior, but
does not actually turn it on. The check option also needs to be set
in order to activate checks of the server. See §5.2 of the haproxy
docs for details:
https://git.haproxy.org/?p=haproxy-2.5.git;a=blob;f=doc/configuration.txt;h=e3949d1eebe171920c451b4cad1d5fcd07d0bfb5;hb=HEAD#l14396
Turn it on for all of our balance_zuul_https server entries.
Also set this on the gitea01 server entry in balance_git_https, so
we can make sure it's still seen as "up" once this change takes
effect. A follow-up change will turn it on for the other
balance_git_https servers out of an abundance of caution around that
service.
Change-Id: I4018507f6e0ee1b5c30139de301e09b3ec6fc494
Switch the port 80 and 443 endpoints over to doing http checks instead
of tcp checks. This ensures that both apache and the zuul-web backend
are functional before balancing to them.
The fingergw remains a tcp check.
Change-Id: Iabe2d7822c9ef7e4514b9a0eb627f15b93ad48e2
Change I5b9f9dd53eb896bb542652e8175c570877842584 introduced this tee
to capture and encrypt the logs. However, we should make sure to fail
if the ansible runs fail. Switch on pipefail, which will exit with an
error if the earlier parts of the pipeline fail. Also make sure we
run under bash.
Change-Id: I2c4cb9aec3d4f8bb5bb93e2d2c20168dc64e78cb
- The extra "/" in the URL makes the download fail, remove it
- The old download python script would output the root on the first
line, then relative urls -- hence the loop was starting from 1.
This should be 0 here, as we just output the raw urls.
- Fix typo in build uuid output
Change-Id: I8ff2a38b3117ddcb0d197fe39f2c168b35ab372b
I didn't consider permissions on the production machine; since we run
Ansible as root the extant path can't access the logs.
By copying the logfile to encrypt to a staging area we can leave
everything else alone for now. Upon reflection it seems like a better
idea to do this in an ephemeral location anyway and not leave anything
behind. We move the cleanup into an always block too to ensure this.
Bump the codesearch playbook to trigger the prod job with these
changes.
Change-Id: I47f63df04d58b7a87bce445da0c0bdcb80edc8f9
This fails if the variable isn't defined; because we limited
I9bd4ed0880596968000b1f153c31df849cd7fa8d to just one job to start,
the others fail with a missing definition.
Change-Id: I74b31f51494e7264e2a68f333943b143842f9a99
Based on the changes in I5b9f9dd53eb896bb542652e8175c570877842584,
enable returning encrypted log artifacts for the codesearch production
job, as an initial test.
Change-Id: I9bd4ed0880596968000b1f153c31df849cd7fa8d
Our production jobs currently only put their logging locally on the
bastion host. This means that to help maintain a production system,
you effectively need full access to the bastion host to debug any
misbehaviour.
We've long discussed publishing these Ansible runs as public logs, or
via a reporting system (ARA, etc.) but, despite our best efforts at
no_log and similar, we are not 100% sure that secret values may not
leak.
This is the infrastructure for an in-between solution, where we
publish the production run logs encrypted to specific GPG public keys.
Here we are capturing and encrypting the logs of the
system-config-run-* jobs, and providing a small download script to
automatically grab and unencrypt the log files. Obviously this is
just to exercise the encryption/log-download path for these jobs, as
the logs are public.
Once this has landed, I will propose similar for the production jobs
(because these are post-pipeline this takes a bit more fiddling and
doens't run in CI). The variables will be setup in such a way that if
someone wishes to help maintain a production system, they can add
their public-key and then add themselves to the particular
infra-prod-* job they wish to view the logs for.
It is planned that the extant operators will be in the default list;
however this is still useful over the status quo -- instead of having
to search through the log history on the bastion host when debugging a
failed run, they can simply view the logs from the failing build in
Zuul directly.
Depends-On: https://review.opendev.org/c/zuul/zuul-jobs/+/828818/
Change-Id: I5b9f9dd53eb896bb542652e8175c570877842584
Previously we were only checking that Apache can open TCP connections to
determine if Gitea is up or down on a backend. This is insufficient
because Gitea itself may be down while Apache is up. In this situation
TCP connection to Apache will function, but if we make an HTTP request
we should get back an error.
To check if both Apache and Gitea are working properly we switch to
using http checks instead. Then if Gitea is down Apache can return a 500
and the Gitea backend will be removed from the pool. Similarly if Apache
is non functional the check will fail to connect via TCP.
Note we don't verify ssl certs for simplicity as checking these in
testing is not straightforward. We didn't have verification with the old
tcp checks so this isn't a regression, but does represent something we
could try and improve in the future.
Change-Id: Id47a1f9028c7575e8fbbd10fabfc9730095cb541
This reenables Gerrit upgrade testing but tests the 3.4 to 3.5 upgrade
now. Note this may need some work to get happy once we have 3.5 images
which is why we've split it out into a separate change.
Change-Id: Ibbbd3f98ac2df8d99d4ffda57df59f4a47da3cd3
The sql connection is no longer supported, we need to use "database"
instead. The corresponding hostvars change has already been made
on bridge.
Change-Id: Ibcac56568f263bd50b2be43baa26c8c514c5272b
The actually upgrade will be performed manually, but this change will be
used to update the docker-compose.yaml file.
If we land this change prior to the upgrade then note the
manage-projects commands will be updated to use the 3.4 image possibly
while gerrit 3.3 is still running. I don't expect this to be a problem
as manage-projects operates via network protocols.
Change-Id: I5775f4518ec48ac984b70820ebd2e645213e702a
It appears that simply setting stdin to an empty string is
insufficient to make newlist calls from Ansible correctly look like
they're coming from a non-interactive shell. As it turns out, newer
versions of the command include a -a (--automate) option which does
exactly what we want: sends list admin notifications on creation
without prompting for manual confirmation.
Drop the test-time addition of -q to quell listadmin notifications,
as we now block outbound 25/tcp from nodes in our deploy tests. This
has repeatedly exposed a testing gap, where the behavior in
production was broken because of newlist processes hanging awaiting
user input even though we never experienced it in testing due to the
-q addition there.
Change-Id: I550ea802929235d55750c4d99c7d9beec28260f0
Our deployment tests don't need to send E-mail messages. More to the
point, they may perform actions which would like to send E-mail
messages. Make sure, at the network level, they'll be prevented from
doing so. Also allow all connections to egress from the loopback
interface, so that services like mailman can connect to the Exim MTA
on localhost.
Add new rolevars for egress rules to support this, and also fix up
some missing related vars in the iptables role's documentation.
Change-Id: If4acd2d3d543933ed1e00156cc83fe3a270612bd
This adds a zuul-client config file as well as a convenience script
to execute the docker container to the schedulers.
Change-Id: Ief167c6b7f0407f5eaebecde552e8d91eb3d4ab9
This used to be called "bridge", but was then renamed with
Ia7c8dd0e32b2c4aaa674061037be5ab66d9a3581 to install-ansible to be
clearer.
It is true that this is installing Ansible, but as part of our
reworking for parallel jobs this is the also the synchronisation point
where we should be deploying the system-config code to run for the
buildset.
Thus naming this "boostrap-bridge" should hopefully be clearer again
about what's going on.
I've added a note to the job calling out it's difference to the
infra-prod-service-bridge job to hopefully also avoid some of the
inital confusion.
Change-Id: I4db1c883f237de5986edb4dc4c64860390cc8e22
This adds a keycloak server so we can start experimenting with it.
It's based on the docker-compose file Matthieu made for Zuul
(see https://review.opendev.org/819745 )
We should be able to configure a realm and federate with openstackid
and other providers as described in the opendev auth spec. However,
I am unable to test federation with openstackid due its inability to
configure an oauth app at "localhost". Therefore, we will need an
actual deployed system to test it. This should allow us to do so.
It will also allow use to connect realms to the newly available
Zuul admin api on opendev.
It should be possible to configure the realm the way we want, then
export its configuration into a JSON file and then have our playbooks
or the docker-compose file import it. That would allow us to drive
change to the configuration of the system through code review. Because
of the above limitation with openstackid, I think we should regard the
current implementation as experimental. Once we have a realm
configuration that we like (which we will create using the GUI), we
can chose to either continue to maintain the config with the GUI and
appropriate file backups, or switch to a gitops model based on an
export.
My understanding is that all the data (realms configuration and session)
are kept in an H2 database. This is probably sufficient for now and even
production use with Zuul, but we should probably switch to mariadb before
any heavy (eg gerrit, etc) production use.
This is a partial implementation of https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html
We can re-deploy with a new domain when it exists.
Change-Id: I2e069b1b220dbd3e0a5754ac094c2b296c141753
Co-Authored-By: Matthieu Huin <mhuin@redhat.com>
This will allow us to issue internally generated auth tokens so
that we can use the zuul CLI to perform actions against the REST
API.
Change-Id: I09cafa2e820f5d0e7fa9ada00b9622de093242c7
This makes the haproxy role more generic so we can run another (or
potentially even more) haproxy instance(s) to manage other services.
The config file is moved to a variable for the haproxy role. The
gitea specific config is then installed for the gitea-lb service by a
new gitea-lb role.
statsd reporting is made optional with an argument. This
enables/disables the service in the docker compose.
Role documenation is updated.
Needed-By: https://review.opendev.org/678159
Change-Id: I3506ebbed9dda17d910001e71b17a865eba4225d
The current opendev-infra-prod-base job sets up the executor to log
into bridge AND copies in Zuul's checkout of system-config to
/home/zuul/src.
This presents an issue for parallel operation, as every production job
is cloning system-config ontop of each other.
Since they all operate in the same buildset, we only need to clone
system-config from Zuul once, and then all jobs can share that repo.
This adds a new job "infra-prod-setup-src" which does this. It is a
dependency of the base job so should run first.
All other jobs now inhert from opendev-infra-prod-setup-keys, which
only sets up the executor for logging into bridge.
Change-Id: I19db98fcec5715c33b62c9c9ba5234fd55700fd8
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/807807
The dependent change moves this into the common infra-prod-base job so
we don't have to do this in here.
Change-Id: I444d2844fe7c7560088c7ef9112893da1496ae62
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/818189
The known_host key is written out by the parent infra-prod-base job in
the run-production-playbook.yaml step [1]. We don't need to do this
here again.
[1] 2c194e5cbf/playbooks/zuul/run-production-playbook.yaml (L1)
Change-Id: I514132b2dbc20ac321a79ca2eb6d4c8b11c4296d
This is a re-implementation of
I195ebee548071b0b89bd5bf64b251595271178ca that puts 9-stream in a
separate AFS volume
(Note the automated volume name "mirror.centos-stream" comes just
short of the limit)
Change-Id: I483c2982a6931e7d6fc97ab82f7750b72d2ef265
Gerrit 3.4 deprecates HTML-based plugins, so the old theme doesn't
work. I have reworked this into a javascript plugin.
This should look the same, although I've achieved things in different
ways.
This doesn't register light and dark variants; since
background-primary-color is white, by setting the
header-background-color to this we get white behind the header bar,
and it correctly switches to the default black(ish) when in dark mode
(currently its seems the header doesn't obey dark mode, so this is an
improvement).
I'm not sure what's going on with the extant header-border-image which
is a linear gradient all of the same color. I modified this down to
1px (same as default) and made it fade in-and-out of the logo colour,
just for fun.
Change-Id: Ia2e32731c1cfe97639de2ec0e7660c7ed583e045
Previously we had set up the test gerrit instance to use the same
hostname as production: review02.opendev.org. This causes some confusion
as we have to override settings specifically for testing like a reduced
heap size, but then also copy settings from the prod host vars as we
override the host vars entirely. Using a new hostname allows us to use a
different set of host vars with unique values reducing confusion.
Change-Id: I4b95bbe1bde29228164a66f2d3b648062423e294
Previously we had a test specific group vars file for the review Ansible
group. This provided junk secrets to our test installations of Gerrit
then we relied on the review02.opendev.org production host vars file to
set values that are public.
Unfortunately, this meant we were using the production heapLimit value
which is far too large for our test instances leading to the occasionaly
failure:
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 9596567552 bytes for committing reserved memory.
We cannot set the heapLimit in the group var file because the hostvar
file overrides those values. To fix this we need to replace the test
specific group var contents with a test specific host var file instead.
To avoid repeating ourselves we also create a new review.yaml group_vars
file to capture common settings between testing and prod. Note we should
look at combining this new file with the gerrit.yaml group_vars.
On the testing side of things we set the heapLimit to 6GB, we change the
serverid value to prevent any unexpected notedb confusion, and we remove
replication config.
Change-Id: Id8ec5cae967cc38acf79ecf18d3a0faac3a9c4b3
The default channel name in the ptgbot role defaults did not
correctly specify a starting hash which it requires, but also the
test jobs seem to need it set in the eavesdrop group vars specific
to testing.
Change-Id: I16cdeac4f7af50e2cac36c80d78f3a87f482e4aa
This shifts our Gerrit upgrade testing ahead to testing 3.3 to 3.4
upgrades as we have upgraded to 3.3 at this point.
Change-Id: Ibb45113dd50f294a2692c65f19f63f83c96a3c11
This bumps the gerrit image up to our 3.3 image. Followup changes will
shift upgrade testing to test 3.3 to 3.4 upgrades, clean up no longer
needed 3.2 images, and start building 3.4 images.
Change-Id: Id0f544846946d4c50737a54ceb909a0a686a594e
Currently we connect to the LE staging environment with acme.sh during
CI to get the DNS-01 tokens (but we never follow-through and actually
generate the certificate, as we have nowhere to publish the tokens).
We've known for a while that LE staging isn't really meant to be used
by CI like this, and recent instability has made the issue pronounced.
This modifies the driver script to generate fake tokens which work to
ensure all the DNS processing, etc. is happening correctly.
I have put this behind a flag so the letsencrypt job still does this
however. I think it is worth this job actually calling acme.sh to
validate this path; this shouldn't be required too often.
Change-Id: I7c0b471a0661aa311aaa861fd2a0d47b07e45a72
Instead of using the opendev.org/... logo file, host a copy from
gerrit's static location and use that. This isolates us from changes
to the way gitea serves its static assets.
Change-Id: I8ffb47e636a59e5ecc3919cc7a16d93de3eae08d
Copy static files directly into the container image instead of
managing them dynamically with Ansible.
Change-Id: I0ebe40ad2a97e87b00137af7c93a3ffa84929a2e
We now depend on the reverse proxy not only for abuse mitigation but
also for serving .well-known files with specific CORS headers. To
reduce complexity and avoid traps in the future, make it non-optional.
Change-Id: I54760cb0907483eee6dd9707bfda88b205fa0fed
We create (a currently test only) playbook that upgrades zuul. This job
then runs through project creation and renaming and testinfra testing on
the upgraded gerrit version.
Future improvements should consider loading state on the old gerrit
install before we upgrade that can be asserted as well.
Change-Id: I364037232cf0e6f3fa150f4dbb736ef27d1be3f8
Etherpad startup says:
2021-08-12 16:08:55.872] [WARN] console - Declaring the sessionKey
in the settings.json is deprecated. This value is auto-generated
now. Please remove the setting from the file. -- If you are seeing
this error after restarting using the Admin User Interface then you
can ignore this message.
So I guess we can remove this.
Change-Id: I5a8da8afe8b128224fa1bc89d5ba06fff16ca29b
We are now using the mariadb jdbc connector in production and no longer
need to include the mysql legacy connector in our images. We also don't
need support for h2 or mysql as testing and prod are all using the
mariadb connector and local database.
Note this is a separate change to ensure everything is happy with the
mariadb connector before we remove the fallback mysql connector from our
images.
Change-Id: I982d3c3c026a5351bff567ce7fbb32798718ec1b
This tests that we can rename both the project and the org the project
lives in. Should just add a bit more robustness to our testing.
Change-Id: I0914e864c787b1dba175e0fabf6ab2648a554d16
Previously we were only managing root's known_hosts via ansible but even
then this wasn't happening because the gerrit_self_hostkey var wasn't
set anywhere. On top of that we need to manage multiple known_hosts
because gerrit must recognize itself and all of the gitea servers.
Update the code to take a dict of host key values and add each entry to
known_hosts for both the root and gerrit2 user.
We remove keyscans from tests to ensure that this update is actually
working.
Change-Id: If64c34322f64c1fb63bf2ebdcc04355fff6ebba2
Thin runs the new matrix-eavesdrop bot on the eavesdrop server.
It will write logs out to the limnoria logs directory, which is mounted
inside the container.
Change-Id: I867eec692f63099b295a37a028ee096c24109a2e
It would be useful to test our rename playbook against gitea and gerrit
when we make changes to these related playbooks, roles, and docker
images. To do this we need to converge our test and production setups
for gerrit a bit more. We create an openstack-project-creator account in
the test gerrit to match prod and we have rename_repos.yaml talk to
localhost for gerrit ssh commands.
With that done we can run the rename_repos.yaml playbook from
test-gitea.yaml and test-gerrit.yaml to help ensure the playbook
functions as expected against these services.
Co-Authored-By: Ian Wienand <iwienand@redhat.com>
Change-Id: I49ffaf86828e87705da303f40ad4a86be030c709
The extant variable name is never set so this never writes anything
out. Move it to a dictionary value. Use stub values for testing,
this way we don't need the "when:".
Additionally remove an unused old template file.
Change-Id: Id96fde79e28f309aa13e16bdda29f004c3c69c4b