The previous version quite reliably causes worker deadlocks within 10
minutes running:
# 100 times:
- import_playbook: integration/async/runner_one_job.yml
# 100 times:
- import_playbook: integration/module_utils/adjacent_to_playbook.yml
via .ci/soak/mitogen.sh with PLAYBOOK= set to the above playbook.
Attaching to the worker with gdb reveals it in an instruction
immediately following a futex() call, which likely returned EINTR due to
attaching gdb. Examining the pthread_mutex_t state reveals it to be
completely unlocked.
pthread_mutex_t on Linux should have zero trouble living in shmem, so
it's not clear how this deadlock is happening. Meanwhile POSIX
semaphores are explicitly designed for cross-process use and have a
completely different internal implementation, so try those instead. 1
hour of soaking reveals no deadlock.
This is about avoiding managing a lockable temporary file on disk to
contain our counter, and somehow communicating a reference to it into
subprocesses (despite the subprocess module closing inherited fds, etc),
somehow deleting it reliably at exit, and somehow avoiding concurrent
Ansible runs stepping on the same file. For now ctypes is still less
pain.
A final possibility would be to abandon a shared counter and instead
pick a CPU based on the hash of e.g. the new child's process ID. That
would likely balance equally well, and might be worth exploring when
making this code work on BSD.
It's no longer necessary, since connection attempts are no longer truly
blocking. When CTRL+C is hit in the top-level process, broker will begin
shutdown, which will cancel all pending connection attempts, causing
pool threads to wake. The pool can't block during shutdown anymore.
"self.initialized = False" slipped in a few days ago, on second thoughts
that flag is not needed at all, by simply rearranging ClassicWorkerModel
to have a regular constructor.
This hierarchy is still squishy, it needs more love. Remaining
MuxProcess class attributes should eliminated.
While catching every possible case where "open file limit exceeded" is
not possible, we can at least increase the soft limit to the available
hard limit without any user effort.
Do this in Ansible top-level process, even though we probably only need
it in the MuxProcess. It seems there is no reason this could hurt
Previously we exitted without calling waitpid(), which meant the
top-level process struct rusage did not reflect the resource usage
consumed by the multiplexer processes.
Existing benchmarks are made using perf so this never created a problem,
but it could be confusing to others using the "time" command, and also
allows logging the final exit status of the process.
Move all details of broker/router setup out of connection.py, instead
deferring it to a WorkerModel class exported by process.py via
get_worker_model(). The running strategy can override the configured
worker model via _get_worker_model().
ClassicWorkerModel is installed by default, which implements the
extension's existing process model.
Add optional support for the third party setproctitle module, so
children have pretty names in ps output.
Add optional support for per-CPU multiplexers to classic runs.
This relies on the previous commit resetting global variables.
Update clean_shutdown() to handle duplicate calls, due to tests
repeatedly installing it.
Not clear what the intention is here. Either need to ferret it out of
some other location, or just stop preloading the connection class in the
top-level process.