The most recent version of the grafana-oss:latest container seems to be
a beta version with some issues, or maybe we need to adapt our
deployment. Until we do this, pin the container to the latest known
working version.
Change-Id: Id50bf3121f3009f36f0f9961cf5211053410a576
How we got here - I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 split the
log collection into a post-run job so we always collect logs, even if
the main run times out. We then realised in
Ic18c89ecaf144a69e82cbe9eeed2641894af71fb that the log timestamp fact
doesn't persist across playbook runs and it's not totally clear how
getting it from hostvars interacts with dynamic inventory.
Thus take an approach that doesn't rely on passing variables; this
simply pulls the time from the stamp we put on the first line of the
log file. We then use that to rename the stored file, which should
correspond more closely with the time the Zuul job actually started.
To further remove confusion when looking at a lot of logs, reset the
timestamps to this time as well.
Change-Id: I7a115c75286e03b09ac3b8982ff0bd01037d34dd
The earlier problems identified with using mod_substitute have been
narrowed down to the new PEP 691 JSON simple API responses from
Warehouse, which are returned as a single line of data. The
currently largest known project index response we've been diagnosing
this problem with is only 1524169 characters in length, but there
are undoubtedly others and they will only continue to grow with
time. The main index is also already over the new 5m limit we set
(nearly double it), and while we don't currently process it with
mod_substitute, we shouldn't make it harder to do so if we need to
later.
Change-Id: Ib32acd48e5166780841695784c55793d014b3580
Reflect changes to mirror vhost configs immediately in their running
Apache services by notifying a new reload handler.
Change-Id: Ib3c9560781116f94b0fdfc56dfa5df3a1af74113
We've been getting the following error for some pages we're proxying
today:
AH01328: Line too long, URI /pypi/simple/grpcio/,
While we suspect PyPI or its Fastly CDN may have served some unusual
contents for the affected package indices, the content gets cached
and then mod_substitute trips over the result because it (as of
2.3.15) enforces a maximum line length of one megabyte:
https://bz.apache.org/bugzilla/show_bug.cgi?id=56176
Override that default to "5m" per the example in Apache's
documentation:
https://httpd.apache.org/docs/2.4/mod/mod_substitute.html
Change-Id: I5351f0465287f695fb2f1957062182fd3bf6c226
Update the Gerrit upgrade job to check for new on disk h2 cache files.
We discovered well after the fact that Gerrit 3.5 added new (large)
cache files to disk that would've been good to be aware of prior to the
upgrade. This change will check for new files and produce a message if
they exist.
Change-Id: I4b52f95dd4b23636c0360c9960d84bbed1a5b2d4
kernel.org has been rejecting rsync attempts with an over-capacity
message for several days now. Switch to the facebook mirror which
seems to be working for 8-stream.
Change-Id: I98de9dd827a3c78a023b677da854089593d5a454
This reverts commit 21c6dc02b5.
Everything appears to be working with Ansible 2.9, which does seem to
sugguest reverting this will result in jobs timing out again. We will
monitor this, and I76ba278d1ffecbd00886531b4554d7aed21c43df is a
potential fix for this.
Change-Id: Id741d037040bde050abefa4ad7888ea508b484f6
When this moved with I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 we lost
access to the variable set as a fact; regenerate it. In a future
change we can look at strategies to share this with the start
timestamp (not totally simple as it is across playbooks on a
dynamicaly added host).
Change-Id: Ic18c89ecaf144a69e82cbe9eeed2641894af71fb
We've been seeing ansible post-run playbook timeouts in our infra-prod
jobs. The only major thing that has changed recently is the default
update to ansible 5 for these jobs. Force them back to 2.9 to see if the
problem goes away.
Albin Vass has noted that there are possibly glibc + debian bullseye +
ansible 5 problems that may be causing this. If we determine 2.9 is
happy then this is the likely cause.
Change-Id: Ibd40e15756077d1c64dba933ec0dff6dc0aac374
I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 added this to ensure that we
always collect logs. However, since this doesn't have bridge
dynamically defined in the playbook, it doesn't run any of the steps.
On the plus side, it doesn't error either.
Change-Id: I97beecbc48c83b9dea661a61e21e0d0d29ca4733
If the production playbook times out, we don't get any logs collected
with the run. By moving the log collection into a post-run step, we
should always get something copied to help us diagnose what is going
wrong.
Change-Id: I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3
These files got moved around and refactored to better support testing of
the Gerrit 3.5 to 3.6 upgrade path. Make sure we trigger the test jobs
when these files are updated.
Change-Id: I5a520e8a8a7c794a761279d4fb98c23e5d25f0ad
ansible_date_time is actually the cached fact time that has little
bearing on the actual time this is running [1] -- which is what you
want to see when, for example, tracing backwards to see why some runs
are randomly timing out.
[1] https://docs.ansible.com/ansible/latest/user_guide/playbooks_vars_facts.html#ansible-facts
Change-Id: I8b5559178e29f8604edf6a42507322fc928afb21
We had two patches that we were carrying locally via iwienands' fork:
https://github.com/ProgVal/Limnoria/pull/1464https://github.com/ProgVal/Limnoria/pull/1473
Both appear to have made it into upstream. Lets go ahead and install
directly from the source. We checkout the most recent tag of master
which seems to be how they checkpoint things. Their most recent proper
release tags are more than a decade old. They have decent CI though so I
expect checking out the checkpoint tag will work fine.
Change-Id: I9fcf17a148a27c2bbdd119961e9df5b38bd6b396
This is a bugfix release that gitea suggests we update to for important
fixes.
Changelog can be found at:
https://github.com/go-gitea/gitea/blob/v1.16.9/CHANGELOG.md
One thing I note is the inclusion of support for git safe.directory in
newer git versions. Our bullseye git version is too old to support this,
but we also configure consistent users so this should be a non issue for
us.
Change-Id: I8c3e4e5eead13eeb72bee3ae6c8b89081cdc5cf0
haproxy only logs to /dev/log; this means all our access logs get
mixed into syslog. This makes it impossible to pick out anything in
syslog that might be interesting (and vice-versa, means you have to
filter out things if analysing just the haproxy logs).
It seems like the standard way to deal with this is to have rsyslogd
listen on a separate socket, and then point haproxy to that. So this
configures rsyslogd to create /var/run/dev/log and maps that into the
container as /dev/log (i.e. don't have to reconfigure the container at
all).
We then capture this sockets logs to /var/log/haproxy.log, and install
rotation for it.
Additionally we collect this log from our tests.
Change-Id: I32948793df7fd9b990c948730349b24361a8f307
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
Change-Id: Ifd5e270604102806736dffa86dff2bf8b23799c5
To make testing more like production, copy the OpenDev CA into the
haproxy container configuration directory during Zuul runs. We then
update the testing configuration to use SSL checking like production
does with this cert.
Change-Id: I1292bc1aa4948c8120dada0f0fd7dfc7ca619afd
Some of our testing makes use of secure communication between testing
nodes; e.g. testing a load-balancer pass-through. Other parts
"loop-back" but require flags like "curl --insecure" because the
self-signed certificates aren't trusted.
To make testing more realistic, create a CA that is distributed and
trusted by all testing nodes early in the Zuul playbook. This then
allows us to sign local certificates created by the letsencrypt
playbooks with this trusted CA and have realistic peer-to-peer secure
communications.
The other thing this does is reworks the letsencrypt self-signed cert
path to correctly setup SAN records for the host. This also improves
the "realism" of our testing environment. This is so realistic that
it requires fixing the gitea playbook :). The Apache service proxying
gitea currently has to override in testing to "localhost" because that
is all the old certificate covered; we can now just proxy to the
hostname directly for testing and production.
Change-Id: I3d49a7b683462a076263127018ec6a0f16735c94
A missed detail of the HTTPS config migration,
/usr/lib/mailman/Mailman/Defaults.py explicitly sets this:
PUBLIC_ARCHIVE_URL = 'http://%(hostname)s/pipermail/%(listname)s/'
Override that setting to https:// so that the archive URL embedded
in E-mail headers will no longer unnecessarily rely on our Apache
redirect. Once merged and deployed, fix_url.py will need to be rerun
for all the lists on both servers in order for this update to take
effect.
Change-Id: Ie4a6e04a2ef0de1db7336a2607059a2ad42665c2
openEuler 20.03 LTS SP2 was out of data in May 2022, and the newest
LTS version is 22.03 LTS which will be maintained to 2024.03.
This Patch add the 22.03-LTS mirror
Change-Id: I2eb72de4eee22a7a8739320ead8376c999993928
We have moved to a situation where we proxy requests to gitea (3000)
via Apache listening on 3081 -- this is useful for layer 7 filtering
like matching on user-agents.
It seems like we missed some of this configuration in our
load-balancer testing. Update the https forward on the load-balancer
to port 3081 on the gitea test host.
Also, remove the explicit port opening in the testing group_vars; for
some reason this was not opening port 3080 (http). This will just use
the production settings when we don't override it.
Change-Id: Ic5690ed893b909a7e6b4074a1e5cd71ab0683ab4
I494a21911a2279228e57ff8d2b731b06a1573438 didn't promote the gerrit
images, so 3.6 remains untagged. Update the stamp to trigger this.
Change-Id: I48c5a5d69fc31bb81f220566bc4360b762a51d63
For the past six months, all our mailing list sites have supported
HTTPS without incident. The main downside to the current
implementation is that Mailman itself writes some URLs with an
explicit scheme, causing people submitting forms from pages served
over HTTPS to get warnings because the forms are posting to plain
HTTP URLs for the same site. In order to correct this, we need to
tell Mailman to put https:// instead of http:// into these, but
doing so essentially eliminates any reason for us to continue
serving content over plain HTTP anyway.
Configure the default URL scheme of all our Mailman sites to use
HTTPS now, and set up permanent redirects from HTTP to HTTPS, per
the examples in the project's documentation:
https://wiki.list.org/DOC/4.27%20Securing%20Mailman%27s%20web%20GUI%20by%20using%20Secure%20HTTP-SSL%20%28HTTPS%29
Also update our testinfra functions to validate the blanket
redirects and perform all other testing over HTTPS.
Once this merges, the fix_url script will need to be run manually
against all lists for the current sites, as noted in that document.
Change-Id: I366bc915685fb47ef723f29d16211a2550e02e34
This moves the gitea partial clone test from our setup playbook into
testinfra/test_gitea.py. We should avoid asserting too much state and
behavior in the ansible as it makes the split between testinfra and
ansible more confusing. To address this we move this behavior check into
testinfra where it belongs.
Change-Id: I6a649bc380f850425c51e9b4632c798a23ab0e0e