In thinking harder about the bootstrap process, it struck me that the
"bastion" group we have is two separate ideas that become a bit
confusing because they share a name.
We have the testing and production paths that need to find a single
bridge node so they can run their nested Ansible. We've recently
merged changes to the setup playbooks to not hard-code the bridge node
and they now use groups["bastion"][0] to find the bastion host -- but
this group is actually orthogonal to the group of the same name
defined in inventory/service/groups.yaml.
The testing and production paths are running on the executor, and, as
mentioned, need to know the bridge node to log into. For the testing
path this is happening via the group created in the job definition
from zuul.d/system-config-run.yaml. For the production jobs, this
group is populated via the add-bastion-host role which dynamically
adds the bridge host and group.
Only the *nested* Ansible running on the bastion host reads
s-c:inventory/service/groups.yaml. None of the nested-ansible
playbooks need to target only the currently active bastion host. For
example, we can define as many bridge nodes as we like in the
inventory and run service-bridge.yaml against them. It won't matter
because the production jobs know the host that is the currently active
bridge as described above.
So, instead of using the same group name in two contexts, rename the
testing/production group "prod_bastion". groups["prod_bastion"][0]
will be the host that the testing/production jobs use as the bastion
host -- references are updated in this change (i.e. the two places
this group is defined -- the group name in the system-config-run jobs,
and add-bastion-host for production).
We then can return the "bastion" group match to bridge*.opendev.org in
inventory/service/groups.yaml.
This fixes a bootstrapping problem -- if you launch, say,
bridge03.opendev.org the launch node script will now apply the
base.yaml playbook against it, and correctly apply all variables from
the "bastion" group which now matches this new host. This is what we
want to ensure, e.g. the zuul user and keys are correctly populated.
The other thing we can do here is change the testing path
"prod_bastion" hostname to "bridge99.opendev.org". By doing this we
ensure we're not hard-coding for the production bridge host in any way
(since if both testing and production are called bridge01.opendev.org
we can hide problems). This is a big advantage when we want to rotate
the production bridge host, as we can be certain there's no hidden
dependencies.
Change-Id: I137ab824b9a09ccb067b8d5f0bb2896192291883
Following-on from Iffb462371939989b03e5d6ac6c5df63aa7708513, instead
of directly referring to a hostname when adding the bastion host to
the inventory for the production playbooks, this finds it from the
first element of the "bastion" group.
As we do this twice for the run and post playbooks, abstract it into a
role.
The host value is currently "bridge.openstack.org" -- as is the
existing hard-coding -- thus this is intended to be a no-op change.
It is setting the foundation to make replacing the bastion host a
simpler process in the future.
Change-Id: I286796ebd71173019a627f8fe8d9a25d0bfc575a
This was introduced by Ifbb5b8acb1f231812905cf9643bfec6fbbd08324. The
flag is actually "disabled". Zuul documentation has been updated with
Ib45ec943d4b227ba254354d116440aa521fb6b9e.
Change-Id: Ie0a0d8f4ae137dc12f4c13f901096ee39d9a088e
By setting this variable (added in the dependent change) Zuul's
shell/command override will not write out streaming spool files in
/tmp. In our case, port 19885 is firewalled off to these hosts, so
they will never be used for streaming results.
Change-Id: Ifbb5b8acb1f231812905cf9643bfec6fbbd08324
Depends-On: https://review.opendev.org/855309
How we got here - I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3 split the
log collection into a post-run job so we always collect logs, even if
the main run times out. We then realised in
Ic18c89ecaf144a69e82cbe9eeed2641894af71fb that the log timestamp fact
doesn't persist across playbook runs and it's not totally clear how
getting it from hostvars interacts with dynamic inventory.
Thus take an approach that doesn't rely on passing variables; this
simply pulls the time from the stamp we put on the first line of the
log file. We then use that to rename the stored file, which should
correspond more closely with the time the Zuul job actually started.
To further remove confusion when looking at a lot of logs, reset the
timestamps to this time as well.
Change-Id: I7a115c75286e03b09ac3b8982ff0bd01037d34dd
If the production playbook times out, we don't get any logs collected
with the run. By moving the log collection into a post-run step, we
should always get something copied to help us diagnose what is going
wrong.
Change-Id: I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3
ansible_date_time is actually the cached fact time that has little
bearing on the actual time this is running [1] -- which is what you
want to see when, for example, tracing backwards to see why some runs
are randomly timing out.
[1] https://docs.ansible.com/ansible/latest/user_guide/playbooks_vars_facts.html#ansible-facts
Change-Id: I8b5559178e29f8604edf6a42507322fc928afb21
Because "." is a field separator for graphite, we're incorrectly
nesting the results.
A better idea seems to be to store these stats under the job name.
That's going to be more helpful when looking up in Zuul build results
anyway.
Follow-on to I90dfb7a25cb5ab08403c89ef59ea21972cf2aae2
Change-Id: Icbb57fd23d8b90f52bc7a0ea5fa80f389ab3892e
We used to track the runtime with the old cron-based system
(I299c0ab5dc3dea4841e560d8fb95b8f3e7df89f2) and had a dashboard view,
which was often helpful to see at a glance what might be going wrong.
Restore this for Zuul CD by simply sending the nested-Ansible task
time-delta and status to graphite. bridge.openstack.org is still
allowed to send stats to graphite from this prior work, so no ports
need to be opened.
Change-Id: I90dfb7a25cb5ab08403c89ef59ea21972cf2aae2
I didn't consider permissions on the production machine; since we run
Ansible as root the extant path can't access the logs.
By copying the logfile to encrypt to a staging area we can leave
everything else alone for now. Upon reflection it seems like a better
idea to do this in an ephemeral location anyway and not leave anything
behind. We move the cleanup into an always block too to ensure this.
Bump the codesearch playbook to trigger the prod job with these
changes.
Change-Id: I47f63df04d58b7a87bce445da0c0bdcb80edc8f9
This fails if the variable isn't defined; because we limited
I9bd4ed0880596968000b1f153c31df849cd7fa8d to just one job to start,
the others fail with a missing definition.
Change-Id: I74b31f51494e7264e2a68f333943b143842f9a99
Based on the changes in I5b9f9dd53eb896bb542652e8175c570877842584,
enable returning encrypted log artifacts for the codesearch production
job, as an initial test.
Change-Id: I9bd4ed0880596968000b1f153c31df849cd7fa8d
The dependent change moves this into the common infra-prod-base job so
we don't have to do this in here.
Change-Id: I444d2844fe7c7560088c7ef9112893da1496ae62
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/818189
The known_host key is written out by the parent infra-prod-base job in
the run-production-playbook.yaml step [1]. We don't need to do this
here again.
[1] 2c194e5cbf/playbooks/zuul/run-production-playbook.yaml (L1)
Change-Id: I514132b2dbc20ac321a79ca2eb6d4c8b11c4296d
We need to add host (and possibly the ssh host key so its here too) in
this playbook because the add_host from the base-jobs side is only
applicable to the playbook running in base-jobs. When we start our
playbook here that state is lost. Simple fix, just add_host it again.
Change-Id: Iee60d04f0232500be745a7a8ca0eac4a6202063d
This uses a new base job which handles pushing the git repos on to
bridge since that must now happen in a trusted playbook.
Depends-On: https://review.opendev.org/742934
Change-Id: Ie6d0668f83af801c0c0e920b676f2f49e19c59f6
The intent of the periodic jobs is to run with latest master. If
they get enqueued, then other patches land, they'll still run with
the value of the zuul ref from when they were enqueued. That's not
what we want for prod, as it can lead to running old versions of
config.
We don't usually like doing this, but in this case, rather than
making us remember to add a flag every time a prod job gets added
to a periodic pipeline, how's about we just calculate it.
Change-Id: Ib999731fe132b1e9f197e51d74066fa75cb6c69b
Pass the ansible_host variable explicitly to mirror-workspace-git-repos
because for some reason it's confused and getting localhost.
Change-Id: I8a30b98a6eef168d11d4d580de359546ee1da252
Put this in in the last patch without a specific need to. But
then we're getting an error. Because of course we are.
Change-Id: I5c982af2e1ba09a78162b2786e31f541247fce21
The mirror-workspace-git role expects things like ansible_port to
be set, but we're not producing them in our add_host command.
Change-Id: Ib80062736e91f8d1471a42edecdebb449f073927
We use project-config for gerrit, gitea and nodepool config. That's
cool, because can clone that from zuul too and make sure that each
prod run we're doing runs with the contents of the patch in question.
Introduce a flag file that can be touched in /home/zuulcd that will
block zuul from running prod playbooks. By default, if the file is
there, zuul will wait for an hour before giving up.
Rename zuulcd to zuul
To better align prod and test, name the zuul user zuul.
Change-Id: I83c38c9c430218059579f3763e02d6b9f40c7b89
We don't have cached repos, and our repos aren't so big
that we want to care about the git push difference.
Also - dont do delete: true like prepare-workspace does,
because deleting and then re-pushing project-config depending
on job would be costly.
Change-Id: I4c7bbc797f9f81878424b7bf2b7e83ec756de108
Instead of running from system-config, run from the zuul prepared
git repo state. We already have a mutex of one, so we'll never
be fighting. This lets us land stacks of changes and be sure they
will accurately always use the correct git state.
As a todo, we should update manage-projects to do the same with
project-config.
Change-Id: I358554e344e12561d1f3063e0724f6b61d1f15a7
We are writing to /var/log/ansible which needs root perms. This was not
done and the writes failed. Fix that.
Change-Id: Ibe93519f2f549e85f0e238a210999c6281f42ce6
This updates prod playbook jobs to curate a set of logs on bridge if we
aren't publishing them to zuul. This way we have history on the bastion
server.
Change-Id: I73889754155298a8554ddc17bb413ae7764b9eae
More importantly, put the log collection in an always
section of block, otherwise we won't get logs if a
playbook fails, which is pretty much exactly when we
want to get logs.
Change-Id: Ia8e581e522f75a5f5945bc2143eec63b93381a94
We need to log to a file and then collect it to the log output on
zuul. Default to true to that steady-state reads nicely. When we
add new jobs we should make sure to set to false first so that we
can vett the output before publishing it.
Change-Id: Ia4f759b82a5fff6e36e4284c11281254c0d5627d
We don't have python2 on bridge.o.o, force python3.
Change-Id: Ie8eb68007c0854329cf3757e577ebcbfd40ed8aa
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
We want to trigger ansible runs on bridge.o.o from zuul jobs. First
iteration of this tried to login as root but this is not allowed by our
ssh config. That config seems reasonable so we add a zuul user instead
which we can ssh in as then run things as root from zuul jobs. This
makes use of our existing user management system.
Change-Id: I257ebb6ffbade4eb645a08d3602a7024069e60b3
This new job is a parent job allowing us to CD from Zuul via
bridge.openstack.org. Using Zuul project ssh keys we add_host bridge.o.o
to our running inventory on the executor then run ansible on bridge.o.o
to run an ansible playbook in
bridge.openstack.org:/opt/system-config/playbooks.
Change-Id: I5cd2dcc53ac480459a22d9e19ef38af78a9e90f7