The etsencrypt_certs variable defined here in the "static" group file
is overwritten by the host variable. This is not doing anything (and
we don't have a logs.openstack.org any more as it is all in object
storage), remove it.
Jammy nodes appear to lack the /etc/apt/sources.list.d dir by default.
Ensure it exists in the install-docker role before we attempt to
install a deb repo config to that directory for docker packages.
The most recent version of the grafana-oss:latest container seems to be
a beta version with some issues, or maybe we need to adapt our
deployment. Until we do this, pin the container to the latest known
How we got here - I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 split the
log collection into a post-run job so we always collect logs, even if
the main run times out. We then realised in
Ic18c89ecaf144a69e82cbe9eeed2641894af71fb that the log timestamp fact
doesn't persist across playbook runs and it's not totally clear how
getting it from hostvars interacts with dynamic inventory.
Thus take an approach that doesn't rely on passing variables; this
simply pulls the time from the stamp we put on the first line of the
log file. We then use that to rename the stored file, which should
correspond more closely with the time the Zuul job actually started.
To further remove confusion when looking at a lot of logs, reset the
timestamps to this time as well.
The earlier problems identified with using mod_substitute have been
narrowed down to the new PEP 691 JSON simple API responses from
Warehouse, which are returned as a single line of data. The
currently largest known project index response we've been diagnosing
this problem with is only 1524169 characters in length, but there
are undoubtedly others and they will only continue to grow with
time. The main index is also already over the new 5m limit we set
(nearly double it), and while we don't currently process it with
mod_substitute, we shouldn't make it harder to do so if we need to
We've been getting the following error for some pages we're proxying
AH01328: Line too long, URI /pypi/simple/grpcio/,
While we suspect PyPI or its Fastly CDN may have served some unusual
contents for the affected package indices, the content gets cached
and then mod_substitute trips over the result because it (as of
2.3.15) enforces a maximum line length of one megabyte:
Override that default to "5m" per the example in Apache's
Update the Gerrit upgrade job to check for new on disk h2 cache files.
We discovered well after the fact that Gerrit 3.5 added new (large)
cache files to disk that would've been good to be aware of prior to the
upgrade. This change will check for new files and produce a message if
kernel.org has been rejecting rsync attempts with an over-capacity
message for several days now. Switch to the facebook mirror which
seems to be working for 8-stream.
This reverts commit 21c6dc02b5.
Everything appears to be working with Ansible 2.9, which does seem to
sugguest reverting this will result in jobs timing out again. We will
monitor this, and I76ba278d1ffecbd00886531b4554d7aed21c43df is a
potential fix for this.
When this moved with I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 we lost
access to the variable set as a fact; regenerate it. In a future
change we can look at strategies to share this with the start
timestamp (not totally simple as it is across playbooks on a
dynamicaly added host).
We've been seeing ansible post-run playbook timeouts in our infra-prod
jobs. The only major thing that has changed recently is the default
update to ansible 5 for these jobs. Force them back to 2.9 to see if the
problem goes away.
Albin Vass has noted that there are possibly glibc + debian bullseye +
ansible 5 problems that may be causing this. If we determine 2.9 is
happy then this is the likely cause.
I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 added this to ensure that we
always collect logs. However, since this doesn't have bridge
dynamically defined in the playbook, it doesn't run any of the steps.
On the plus side, it doesn't error either.
If the production playbook times out, we don't get any logs collected
with the run. By moving the log collection into a post-run step, we
should always get something copied to help us diagnose what is going
These files got moved around and refactored to better support testing of
the Gerrit 3.5 to 3.6 upgrade path. Make sure we trigger the test jobs
when these files are updated.
We had two patches that we were carrying locally via iwienands' fork:
Both appear to have made it into upstream. Lets go ahead and install
directly from the source. We checkout the most recent tag of master
which seems to be how they checkpoint things. Their most recent proper
release tags are more than a decade old. They have decent CI though so I
expect checking out the checkpoint tag will work fine.
This is a bugfix release that gitea suggests we update to for important
Changelog can be found at:
One thing I note is the inclusion of support for git safe.directory in
newer git versions. Our bullseye git version is too old to support this,
but we also configure consistent users so this should be a non issue for
haproxy only logs to /dev/log; this means all our access logs get
mixed into syslog. This makes it impossible to pick out anything in
syslog that might be interesting (and vice-versa, means you have to
filter out things if analysing just the haproxy logs).
It seems like the standard way to deal with this is to have rsyslogd
listen on a separate socket, and then point haproxy to that. So this
configures rsyslogd to create /var/run/dev/log and maps that into the
container as /dev/log (i.e. don't have to reconfigure the container at
We then capture this sockets logs to /var/log/haproxy.log, and install
rotation for it.
Additionally we collect this log from our tests.
This explicitly tests connection through the load-balancer to the
gitea backend to ensure correct operation.
Additionally, it adds a check of the haproxy output to make sure the
back-ends are active (that's the srv_op_state field, c.f. )
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
To make testing more like production, copy the OpenDev CA into the
haproxy container configuration directory during Zuul runs. We then
update the testing configuration to use SSL checking like production
does with this cert.
Some of our testing makes use of secure communication between testing
nodes; e.g. testing a load-balancer pass-through. Other parts
"loop-back" but require flags like "curl --insecure" because the
self-signed certificates aren't trusted.
To make testing more realistic, create a CA that is distributed and
trusted by all testing nodes early in the Zuul playbook. This then
allows us to sign local certificates created by the letsencrypt
playbooks with this trusted CA and have realistic peer-to-peer secure
The other thing this does is reworks the letsencrypt self-signed cert
path to correctly setup SAN records for the host. This also improves
the "realism" of our testing environment. This is so realistic that
it requires fixing the gitea playbook :). The Apache service proxying
gitea currently has to override in testing to "localhost" because that
is all the old certificate covered; we can now just proxy to the
hostname directly for testing and production.
A missed detail of the HTTPS config migration,
/usr/lib/mailman/Mailman/Defaults.py explicitly sets this:
PUBLIC_ARCHIVE_URL = 'http://%(hostname)s/pipermail/%(listname)s/'
Override that setting to https:// so that the archive URL embedded
in E-mail headers will no longer unnecessarily rely on our Apache
redirect. Once merged and deployed, fix_url.py will need to be rerun
for all the lists on both servers in order for this update to take
openEuler 20.03 LTS SP2 was out of data in May 2022, and the newest
LTS version is 22.03 LTS which will be maintained to 2024.03.
This Patch add the 22.03-LTS mirror