In thinking harder about the bootstrap process, it struck me that the
"bastion" group we have is two separate ideas that become a bit
confusing because they share a name.
We have the testing and production paths that need to find a single
bridge node so they can run their nested Ansible. We've recently
merged changes to the setup playbooks to not hard-code the bridge node
and they now use groups["bastion"] to find the bastion host -- but
this group is actually orthogonal to the group of the same name
defined in inventory/service/groups.yaml.
The testing and production paths are running on the executor, and, as
mentioned, need to know the bridge node to log into. For the testing
path this is happening via the group created in the job definition
from zuul.d/system-config-run.yaml. For the production jobs, this
group is populated via the add-bastion-host role which dynamically
adds the bridge host and group.
Only the *nested* Ansible running on the bastion host reads
s-c:inventory/service/groups.yaml. None of the nested-ansible
playbooks need to target only the currently active bastion host. For
example, we can define as many bridge nodes as we like in the
inventory and run service-bridge.yaml against them. It won't matter
because the production jobs know the host that is the currently active
bridge as described above.
So, instead of using the same group name in two contexts, rename the
testing/production group "prod_bastion". groups["prod_bastion"]
will be the host that the testing/production jobs use as the bastion
host -- references are updated in this change (i.e. the two places
this group is defined -- the group name in the system-config-run jobs,
and add-bastion-host for production).
We then can return the "bastion" group match to bridge*.opendev.org in
This fixes a bootstrapping problem -- if you launch, say,
bridge03.opendev.org the launch node script will now apply the
base.yaml playbook against it, and correctly apply all variables from
the "bastion" group which now matches this new host. This is what we
want to ensure, e.g. the zuul user and keys are correctly populated.
The other thing we can do here is change the testing path
"prod_bastion" hostname to "bridge99.opendev.org". By doing this we
ensure we're not hard-coding for the production bridge host in any way
(since if both testing and production are called bridge01.opendev.org
we can hide problems). This is a big advantage when we want to rotate
the production bridge host, as we can be certain there's no hidden
Following-on from Iffb462371939989b03e5d6ac6c5df63aa7708513, instead
of directly referring to a hostname when adding the bastion host to
the inventory for the production playbooks, this finds it from the
first element of the "bastion" group.
As we do this twice for the run and post playbooks, abstract it into a
The host value is currently "bridge.openstack.org" -- as is the
existing hard-coding -- thus this is intended to be a no-op change.
It is setting the foundation to make replacing the bastion host a
simpler process in the future.
This was introduced by Ifbb5b8acb1f231812905cf9643bfec6fbbd08324. The
flag is actually "disabled". Zuul documentation has been updated with
By setting this variable (added in the dependent change) Zuul's
shell/command override will not write out streaming spool files in
/tmp. In our case, port 19885 is firewalled off to these hosts, so
they will never be used for streaming results.
Our infra prod jobs use a strftime format string to update log file
modication times. This format string had a stray '%' in it leading to:
"Error while obtaining timestamp for time 2022-08-04T18:02:59 using
format %Y-%m%-%dT%H:%M:%S: '-' is a bad directive in format
Fix that by removing the extra '%'.
How we got here - I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 split the
log collection into a post-run job so we always collect logs, even if
the main run times out. We then realised in
Ic18c89ecaf144a69e82cbe9eeed2641894af71fb that the log timestamp fact
doesn't persist across playbook runs and it's not totally clear how
getting it from hostvars interacts with dynamic inventory.
Thus take an approach that doesn't rely on passing variables; this
simply pulls the time from the stamp we put on the first line of the
log file. We then use that to rename the stored file, which should
correspond more closely with the time the Zuul job actually started.
To further remove confusion when looking at a lot of logs, reset the
timestamps to this time as well.
When this moved with I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 we lost
access to the variable set as a fact; regenerate it. In a future
change we can look at strategies to share this with the start
timestamp (not totally simple as it is across playbooks on a
dynamicaly added host).
I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 added this to ensure that we
always collect logs. However, since this doesn't have bridge
dynamically defined in the playbook, it doesn't run any of the steps.
On the plus side, it doesn't error either.
If the production playbook times out, we don't get any logs collected
with the run. By moving the log collection into a post-run step, we
should always get something copied to help us diagnose what is going