Without this, MuxProcess will start dying too early, before Ansible /
TaskQueueManager.cleanup() has a chance to wait on worker processes.
That would allow WorkerProcess to see ECONNREFUSED from the MuxProcess
socket much more easily.
* origin/dmw:
issue #615: ensure 4GB max_message_size is configured for task workers.
issue #615: update Changelog.
issue #615: route a dead message to recipients when no reply is expected
issue #615: fetch_file() might be called with AnsibleUnicode.
issue #615: redirect 'fetch' action to 'mitogen_fetch'.
issue #615: extricate slurp brainwrong from mitogen_fetch
issue #615: ansible: import Ansible fetch.py action plug-in
issue #533: include object identity of Stream in repr()
docs: lots more changelog
issue #595: add buildah to docs and changelog.
docs: a few more internals.rst additions
This 4GB limit was already set for MuxProcess and inherited by all
descendents including the context running on the target host, but it was
not applied to the WorkerProcess router.
That explains why the error from the ticket is being raised by the
router within the WorkerProcess rather than the router on the original
target.
* origin/dmw:
ci: update to Ansible 2.8.3
tests: another random string changed in 2.8.3
tests: fix sudo_flags_failure for Ansible 2.8.3
ci: fix procps command line format warning
The undocumented 'tmp' parameter controls whether _execute_module()
would delete anything on 2.3, so mimic that. This means
_execute_remove_stat() calls will not blow away the temp directory,
which broke the unarchive plugin.
* origin/dmw:
issue #613: must await 'exit' and 'disconnect' in wait=False test
Import LGTM config to disable some stuff
Fix up another handful of LGTM errors.
tests: work around AnsibleModule.run_command() race.
docs: mention another __main__ safeguard
docs: tweaks
formatting error
docs: make Sphinx install soft fail on Python 2.
issue #598: allow disabling preempt in terraform
issue #598: update Changelog.
* origin/dmw:
issue #605: update Changelog.
issue #605: ansible: share a sem_t instead of a pthread_mutex_t
issue #613: add tests for all the weird shutdown methods
Add mitogen.core.now() and use it everywhere; closes#614.
docs: move decorator docs into core.py and use autodecorator
preamble_size: make it work on Python 3.
docs: upgrade Sphinx to 2.1.2, require Python 3 to build docs.
docs: fix Sphinx warnings, add LogHandler, more docstrings
docs: tidy up some Changelog text
The previous version quite reliably causes worker deadlocks within 10
minutes running:
# 100 times:
- import_playbook: integration/async/runner_one_job.yml
# 100 times:
- import_playbook: integration/module_utils/adjacent_to_playbook.yml
via .ci/soak/mitogen.sh with PLAYBOOK= set to the above playbook.
Attaching to the worker with gdb reveals it in an instruction
immediately following a futex() call, which likely returned EINTR due to
attaching gdb. Examining the pthread_mutex_t state reveals it to be
completely unlocked.
pthread_mutex_t on Linux should have zero trouble living in shmem, so
it's not clear how this deadlock is happening. Meanwhile POSIX
semaphores are explicitly designed for cross-process use and have a
completely different internal implementation, so try those instead. 1
hour of soaking reveals no deadlock.
This is about avoiding managing a lockable temporary file on disk to
contain our counter, and somehow communicating a reference to it into
subprocesses (despite the subprocess module closing inherited fds, etc),
somehow deleting it reliably at exit, and somehow avoiding concurrent
Ansible runs stepping on the same file. For now ctypes is still less
pain.
A final possibility would be to abandon a shared counter and instead
pick a CPU based on the hash of e.g. the new child's process ID. That
would likely balance equally well, and might be worth exploring when
making this code work on BSD.