- don't create a new connection during reset if no existing connection
exists
- strip off last hop in connection stack if PlayContext.become is True.
- log a debug message if reset cannot find an existing connection
It used to be set by on_action_run() from task_vars, but this doesn't
work for meta: reset_connection. That meant MITOGEN_CPU_COUNT>1 would
pick the wrong mux to reset the connection on.
Without this, MuxProcess will start dying too early, before Ansible /
TaskQueueManager.cleanup() has a chance to wait on worker processes.
That would allow WorkerProcess to see ECONNREFUSED from the MuxProcess
socket much more easily.
This 4GB limit was already set for MuxProcess and inherited by all
descendents including the context running on the target host, but it was
not applied to the WorkerProcess router.
That explains why the error from the ticket is being raised by the
router within the WorkerProcess rather than the router on the original
target.
The undocumented 'tmp' parameter controls whether _execute_module()
would delete anything on 2.3, so mimic that. This means
_execute_remove_stat() calls will not blow away the temp directory,
which broke the unarchive plugin.
The previous version quite reliably causes worker deadlocks within 10
minutes running:
# 100 times:
- import_playbook: integration/async/runner_one_job.yml
# 100 times:
- import_playbook: integration/module_utils/adjacent_to_playbook.yml
via .ci/soak/mitogen.sh with PLAYBOOK= set to the above playbook.
Attaching to the worker with gdb reveals it in an instruction
immediately following a futex() call, which likely returned EINTR due to
attaching gdb. Examining the pthread_mutex_t state reveals it to be
completely unlocked.
pthread_mutex_t on Linux should have zero trouble living in shmem, so
it's not clear how this deadlock is happening. Meanwhile POSIX
semaphores are explicitly designed for cross-process use and have a
completely different internal implementation, so try those instead. 1
hour of soaking reveals no deadlock.
This is about avoiding managing a lockable temporary file on disk to
contain our counter, and somehow communicating a reference to it into
subprocesses (despite the subprocess module closing inherited fds, etc),
somehow deleting it reliably at exit, and somehow avoiding concurrent
Ansible runs stepping on the same file. For now ctypes is still less
pain.
A final possibility would be to abandon a shared counter and instead
pick a CPU based on the hash of e.g. the new child's process ID. That
would likely balance equally well, and might be worth exploring when
making this code work on BSD.
It's no longer necessary, since connection attempts are no longer truly
blocking. When CTRL+C is hit in the top-level process, broker will begin
shutdown, which will cancel all pending connection attempts, causing
pool threads to wake. The pool can't block during shutdown anymore.