Commit Graph

522 Commits (6bdcbff7b2284d783ef4ede85cb656d8f35fabf1)

Author SHA1 Message Date
David Wilson f78a5f08c6 issue #605: ansible: share a sem_t instead of a pthread_mutex_t
The previous version quite reliably causes worker deadlocks within 10
minutes running:

    # 100 times:
    - import_playbook: integration/async/runner_one_job.yml
    # 100 times:
    - import_playbook: integration/module_utils/adjacent_to_playbook.yml

via .ci/soak/mitogen.sh with PLAYBOOK= set to the above playbook.

Attaching to the worker with gdb reveals it in an instruction
immediately following a futex() call, which likely returned EINTR due to
attaching gdb. Examining the pthread_mutex_t state reveals it to be
completely unlocked.

pthread_mutex_t on Linux should have zero trouble living in shmem, so
it's not clear how this deadlock is happening. Meanwhile POSIX
semaphores are explicitly designed for cross-process use and have a
completely different internal implementation, so try those instead. 1
hour of soaking reveals no deadlock.

This is about avoiding managing a lockable temporary file on disk to
contain our counter, and somehow communicating a reference to it into
subprocesses (despite the subprocess module closing inherited fds, etc),
somehow deleting it reliably at exit, and somehow avoiding concurrent
Ansible runs stepping on the same file. For now ctypes is still less
pain.

A final possibility would be to abandon a shared counter and instead
pick a CPU based on the hash of e.g. the new child's process ID. That
would likely balance equally well, and might be worth exploring when
making this code work on BSD.
5 years ago
David Wilson 5af6c9b26f issue #615: use FileService for target->controll file transfers 5 years ago
David Wilson 6f12980611 [linear2] merge fallout: re-enable _send_module_forwards(). 5 years ago
David Wilson 5298e87548 Split out and make readable more log messages across both packages 5 years ago
David Wilson 0f23a90d50 ansible: log affinity assignments 5 years ago
David Wilson 4f051a38a7 ansible: improve docstring 5 years ago
David Wilson 5811909c8d [linear2] simplify _listener_for_name() 5 years ago
David Wilson c68dbdd569 ansible: stop relying on SIGTERM to shut down service pool
It's no longer necessary, since connection attempts are no longer truly
blocking. When CTRL+C is hit in the top-level process, broker will begin
shutdown, which will cancel all pending connection attempts, causing
pool threads to wake. The pool can't block during shutdown anymore.
5 years ago
David Wilson f4ca926b21 ansible: cleanup various docstrings 5 years ago
David Wilson edde251d58 issue #549: ansible: reduce risk by capping RLIM_INFINITY 5 years ago
David Wilson d408caccf5 issue #573: guard against a forked top-level Ansible process
See comment.
5 years ago
David Wilson 3ceac2c9ed [linear2] simplify ClassicWorkerModel and fix repeat initialization
"self.initialized = False" slipped in a few days ago, on second thoughts
that flag is not needed at all, by simply rearranging ClassicWorkerModel
to have a regular constructor.

This hierarchy is still squishy, it needs more love. Remaining
MuxProcess class attributes should eliminated.
5 years ago
David Wilson 395b03a77d issue #549: fix setrlimit() crash and hard-wire OS X default
OS X advertised unlimited, but really it means kern.maxfilesperproc.
5 years ago
David Wilson 33bceb6eb4 issue #602: recover task_vars for synchronize and meta: reset_connection 5 years ago
David Wilson 6b4bcf4fe0 ansible: remove cutpasted docstring 5 years ago
David Wilson 619f4dee07 [linear2] merge fallout: restore optimization from #491 / 7b129e857 5 years ago
David Wilson e4321f81a0 issue #600: /etc/environment may be non-ASCII in an unknown encoding 5 years ago
David Wilson 75d179e4b9 remove unused imports flagged by lgtm 5 years ago
David Wilson c80fddd487 [linear2]: merge fallout flaggged by LGTM 5 years ago
David Wilson eeb7150f24 issue #549: increase open file limit automatically if possible
While catching every possible case where "open file limit exceeded" is
not possible, we can at least increase the soft limit to the available
hard limit without any user effort.

Do this in Ansible top-level process, even though we probably only need
it in the MuxProcess. It seems there is no reason this could hurt
5 years ago
David Wilson acab26d796 ansible: improve process.py docs 5 years ago
David Wilson 4dfbe82e76 tests: hide ugly error during Ansible tests 5 years ago
David Wilson 108015aa22 ansible: gracefully handle failure to connect to MuxProcess
It's possible to hit an ugly exception during early CTRL+C
5 years ago
David Wilson bf1f3682aa ansible: pin per-CPU muxes to their corresponding CPU
This slightly breaks the old scheme, in that CPU 1 may now end up with a
mux and the top-level process pinned to it.
5 years ago
David Wilson dc9f4e89e6 ansible: reap mux processes on shut down
Previously we exitted without calling waitpid(), which meant the
top-level process struct rusage did not reflect the resource usage
consumed by the multiplexer processes.

Existing benchmarks are made using perf so this never created a problem,
but it could be confusing to others using the "time" command, and also
allows logging the final exit status of the process.
5 years ago
David Wilson 136dee1fb4 [linear2] more merge fallout, fix Connection._mitogen_reset(mode=) 5 years ago
David Wilson a9755d4ad0 [linear2] update mitogen_get_stack for new _build_stack() return value 5 years ago
David Wilson 1fca0b7a94 [linear2] fix MuxProcess test fixture and some merge fallout 5 years ago
David Wilson 0f63ca4c68 Make setting affinity optional. 5 years ago
David Wilson 9035884c77 ansible: abstract worker process model.
Move all details of broker/router setup out of connection.py, instead
deferring it to a WorkerModel class exported by process.py via
get_worker_model(). The running strategy can override the configured
worker model via _get_worker_model().

ClassicWorkerModel is installed by default, which implements the
extension's existing process model.

Add optional support for the third party setproctitle module, so
children have pretty names in ps output.

Add optional support for per-CPU multiplexers to classic runs.
5 years ago
David Wilson 402dba4197 module_finder: pass raw file to compile()
Newer Ansibles have e.g. UTF-8 present in apt.py.
5 years ago
David Wilson 1aceacf89e [stream-refactor] replace old detach_popen() reference 5 years ago
David Wilson 300f8b2ff9 ansible: fixturize creation of MuxProcess
This relies on the previous commit resetting global variables.

Update clean_shutdown() to handle duplicate calls, due to tests
repeatedly installing it.
5 years ago
David Wilson 26b6333787 [stream-refactor] fix unix.Listener construction 5 years ago
Jordan Webb 1a02a86331
Add buildah transport 6 years ago
David Wilson 7ae926b325 ansible: prevent tempfile.mkstemp() leaks.
This avoids a leak present in Ansible 2.7.0..current HEAD, and all
similar leaks.

See ansible/ansible#57327.
6 years ago
David Wilson 3620fce071 issue #593: expose configurables for SSH keepalive and increase the default 6 years ago
David Wilson 0b7fd3f290 issue #591: ansible: restore CWD prior to AnsibleModule initialization. 6 years ago
David Wilson 4f23f0bec1 issue #590: update comment to indicate the hack is permanent 6 years ago
David Wilson 1a92995a24 issue #590: include nasty workaround for sys.modules junk 6 years ago
David Wilson 92b4724010 issue #587: consistent become_exe() behaviour for older Ansibles. 6 years ago
David Wilson f35194fe0f issue #587: mitogen_doas should not become_exe for doas_path
Looks like this has always been wrong - when used as a connection
method, PlayContext.become_method/become_exe may hold totally unrelated
data.
6 years ago
David Wilson c1c8d5c31e issue #587: 2.8 PlayContext lacks sudo_flags attribute.
This is a huge bodge.
6 years ago
David Wilson e11b251c75 issue #587: 2.8 PluginLoader.get() introduced new collection_list kwarg 6 years ago
David Wilson 46dde95962 issue #587: 2.8 PlayContext.connection no longer contains connection name
Not clear what the intention is here. Either need to ferret it out of
some other location, or just stop preloading the connection class in the
top-level process.
6 years ago
David Wilson 4a614c3950 issue #587: bump max Ansible version 6 years ago
David Wilson f105a81e20 ansible: descriptive version check during startup. 6 years ago
David Wilson f30a4c05c8 issue #581: expose mitogen_mask_remote_name variable. 6 years ago
David Wilson 65deb3feac issue #575: fix exception text rendering 6 years ago
David Wilson 34fb9da1be issue #570: add firewalld to always-fork list for now. 6 years ago
David Wilson 3ff6123483 issue #557: support correct cpu_set_t size 6 years ago
David Wilson 2bd0bbd4df issue #555: ansible: workaround ancient reload(sys) hack.
This is the most minimal change for what might be relatively minimal
edge case. Alternative is replacing reload(), but let's not do that yet.

Closes #555
6 years ago
David Wilson 6309774be2 issue #554: fix Ansible 2.4 compatibility 6 years ago
David Wilson 7743e57ff3 issue #554: track and remove multiple make_tmp_path() calls. 6 years ago
David Wilson 7dacb68eeb issue #552: include process identity in log messages. 6 years ago
David Wilson 26e6194d0a issue #548: always treat transport=smart as 'ssh' for mitogen_via=.
The idea behind transport=smart is to select between paramiko and
OpenSSH given the availability of connection multiplexing and/or OSX
kernel bugs. We need to make no such choice.
6 years ago
David Wilson 458a4faa97 ansible: create stub __init__.py for sdist.
This went into 0.2.5 sdist tarball but it's not checked in.
6 years ago
David Wilson 8f9c67daf1 ansible: refactor affinity class and add abstract tests. 6 years ago
David Wilson 0f30808234 ansible: quiesce boto logger; closes #541. 6 years ago
David Wilson 7fd0d34910 tests/ansible: Spec.port() test & mitogen_via= fix.
ansible_ssh_port was not respected.
6 years ago
David Wilson 1f77d24bec Update copyright year everywhere. 6 years ago
David Wilson b5b23e8f3d tests/ansible: Spec.become_pass() test. 6 years ago
David Wilson ae5a471e31 issue #539: disable logger propagation. 6 years ago
David Wilson 1c955a9876 ansible: capture stderr stream of async tasks. Closes #540. 6 years ago
David Wilson 7ff4e6694c issue #536: rework how 2.3-compatible simplejson is served
Regardless of the version of simplejson loaded in the master, load up
the ModuleResponder cache with our 2.4-compatible version.

To cope with simplejson being loaded due to modules like ec2_group that
try to import it before importing 'json', also update target.py to
remove it from the whitelist if a local 'json' module import succeeds.
6 years ago
David Wilson 8ae6ca1d5b tests/ansible: Spec.become_method() test & mitogen_via= fix.
ansible_become_method hostvar was not taken into account.
6 years ago
David Wilson d1cadf8ac8 tests/ansible: Spec.password() test, document interactive pw limitation. 6 years ago
David Wilson 21ad299d7b tests/ansible: Spec.remote_user() test & mitogen_via= fix.
ansible_ssh_user precedence was incorrect.
6 years ago
David Wilson 748f5f675d tests/ansible: Spec.remote_addr() test & mitogen_via= fix.
ansible_ssh_host was not respected.
6 years ago
David Wilson e1df98168c issue #536: add mitogen_via= tests too. 6 years ago
David Wilson 604b418412 ansible: fix a crash on 2.3 when mitogen_via= host is missing. 6 years ago
David Wilson 001e3fee86 issue #536: restore correct Python interpreter selection behaviour. 6 years ago
David Wilson 05b1ccb658 ansible: stash PID files in CWD if requested for debugging. 6 years ago
David Wilson eb67fbe9d2 ansible: double the default pool size.
Tempted to push this up to 64, but let's do it incrementally just in
case.
6 years ago
David Wilson b89e53fd70 ansible: raise error with correct exception type. 6 years ago
David Wilson 0e193c223c issue #508: master: minify all Mitogen/ansible_mitogen sources.
Minify-safe files are marked with a magical "# !mitogen: minify_safe"
comment anywhere in the file, which activates the minifier. The result
is naturally cached by ModuleResponder, therefore lru_cache is gone too.

Given:

    import os, mitogen
    @mitogen.main()
    def main(router):
        c = router.ssh(hostname='k3')
        c.call(os.getpid)
        router.sudo(via=c)

SSH footprint drops from 56.2 KiB to 42.75 KiB (-23.9%)
Ansible "shell: hostname" drops 149.26 KiB to 117.42 KiB (-21.3%)
6 years ago
David Wilson 7badb4a25b ansible: hacky parser to alow bools to be specified on command line 6 years ago
David Wilson b499fbe29b ansible: add mitogen_ssh_compression variable. 6 years ago
David Wilson a2ae4ed696 SyntaxError. 6 years ago
David Wilson a9d48a8fdc ansible: don't pin controller if <4 cores. 6 years ago
David Wilson 4531338b12 ansible: document and make affinity stuff portable to non-Linux
Portable as in does nothing for the time at least for now.
6 years ago
David Wilson de5c050707 ansible: fix affinity.py test failure on 2 cores. 6 years ago
David Wilson 00ae90b2b2 ansible: preheat PluginLoader caches before fork.
This has been broken for some time, but somehow it has become noticeable
on recent Ansible.

loop-100-tasks.yml before:
      15.532724001 seconds time elapsed
       8.453850000 seconds user
       5.808627000 seconds sys

loop-100-tasks.yml after:
       8.991635735 seconds time elapsed
       5.059232000 seconds user
       2.578842000 seconds sys
6 years ago
David Wilson 7b129e8576 ansible: use Poller for WorkerProcess; closes #491. 6 years ago
David Wilson c6d5aa29ba ansible: new multiplexer/workers configuration
Following on from 152effc26c9a5918cb7ead7a97fe7fa7f81b6764,

* Pin mux to CPU 0
* Pin top-level CPU 1
* Pin workers sequentially to CPU 2..n

Nets 19.5% improvement on issue_140__thread_pileup.yml when targetting
64 Docker containers on the same 8 core/16 thread machine.

Before (prior to last scheme, no affinity at all):

    2294528.731458      task-clock (msec)         #    6.443 CPUs utilized
        10,429,745      context-switches          #    0.005 M/sec
         2,049,618      cpu-migrations            #    0.893 K/sec
         8,258,952      page-faults               #    0.004 M/sec
 5,532,719,253,824      cycles                    #    2.411 GHz                      (83.35%)
 3,267,471,616,230      instructions              #    0.59  insn per cycle
                                                  #    1.22  stalled cycles per insn  (83.35%)
   662,006,455,943      branches                  #  288.515 M/sec                    (83.33%)
    39,453,895,977      branch-misses             #    5.96% of all branches          (83.37%)

     356.148064576 seconds time elapsed

After:

    2226463.958975      task-clock (msec)         #    7.784 CPUs utilized
         9,831,466      context-switches          #    0.004 M/sec
           180,065      cpu-migrations            #    0.081 K/sec
         5,082,278      page-faults               #    0.002 M/sec
 5,592,548,587,259      cycles                    #    2.512 GHz                      (83.35%)
 3,135,038,855,414      instructions              #    0.56  insn per cycle
                                                  #    1.32  stalled cycles per insn  (83.32%)
   636,397,509,232      branches                  #  285.833 M/sec                    (83.30%)
    39,135,441,790      branch-misses             #    6.15% of all branches          (83.35%)

     286.036681644 seconds time elapsed
6 years ago
David Wilson 1b909e8697 ansible: pin connection multiplexer to a single core
Nets a reliable 8% improvement in issue_140__thread_pileup.yml when
targetting 64 Docker containers on the same 8 core/16 thread machine.

Before:
    2294528.731458      task-clock (msec)         #    6.443 CPUs utilized
        10,429,745      context-switches          #    0.005 M/sec
         2,049,618      cpu-migrations            #    0.893 K/sec
         8,258,952      page-faults               #    0.004 M/sec
 5,532,719,253,824      cycles                    #    2.411 GHz                      (83.35%)
 4,001,276,805,120      stalled-cycles-frontend   #   72.32% frontend cycles idle     (83.30%)
 2,024,159,442,463      stalled-cycles-backend    #   36.59% backend cycles idle      (66.65%)
 3,267,471,616,230      instructions              #    0.59  insn per cycle
                                                  #    1.22  stalled cycles per insn  (83.35%)
   662,006,455,943      branches                  #  288.515 M/sec                    (83.33%)
    39,453,895,977      branch-misses             #    5.96% of all branches          (83.37%)

     356.148064576 seconds time elapsed

After:
    2208247.938562      task-clock (msec)         #    6.735 CPUs utilized
         8,489,840      context-switches          #    0.004 M/sec
         1,432,967      cpu-migrations            #    0.649 K/sec
         7,508,957      page-faults               #    0.003 M/sec
 5,477,293,750,357      cycles                    #    2.480 GHz                      (83.31%)
 3,984,360,350,811      stalled-cycles-frontend   #   72.74% frontend cycles idle     (83.32%)
 1,976,646,418,711      stalled-cycles-backend    #   36.09% backend cycles idle      (66.64%)
 3,196,197,480,792      instructions              #    0.58  insn per cycle
                                                  #    1.25  stalled cycles per insn  (83.36%)
   648,247,332,967      branches                  #  293.557 M/sec                    (83.35%)
    39,004,881,070      branch-misses             #    6.02% of all branches          (83.37%)

     327.876903668 seconds time elapsed
6 years ago
David Wilson e587396e70 ansible: hook strategy and worker processes into profiler 6 years ago
David Wilson 84944a9a61 ansible: ensure MuxProcess MITOGEN_PROFILING results reach disk.
This has been broken for quite some time.
6 years ago
David Wilson 954f874085 issue #527: catch new-style module tracebacks like vanilla. 6 years ago
David Wilson a1121c5a84 issue #499: respect C.BECOME_ALLOW_SAME_USER. 6 years ago
David Wilson be6ab52fe1 issue #488: fix shutdown damage caused in 6ca2677de5
os._exit() subverted calm shutdown, meaning unix.Listener never had a
chance to cleanup its socket.

Move unix.Listener socket cleanup into its class so it is automatic
during shutdown, rather than cutpasted for each consumer.

Disable the watcher thread in the MuxProcess, it is useless.

Add .sock extension to /tmp/mitogen_unix_*, so we can write a test.
6 years ago
David Wilson 38a553d42d issue #490: prevent double close() destroying unrelated Connection. 6 years ago
David Wilson e7fe95af88 issue #477: fix sudo_args selection. 6 years ago
David Wilson 599da0689a issue #477 / ansible: avoid a race in async job startup.
Ansible 2.3/Python 2.4 work revealed there is no guarantee a slow target
will have written the initial job status file out before a fast
controller makes an initial check for it. Therefore, provide AsyncRunner
with a sender it should send a message to when the initial job file has
been written.

As a bonus, also catch and report exceptions happening early in
AsyncRunner, rather than leaving them to end up in -vvv output.
6 years ago
David Wilson 0175052099 issue #477: fix source of become_flags on 2.3. 6 years ago
David Wilson 97f3cfe4f4 issue #477: target.file_exists() wrapper.
os.path.exists physical module name varies across major Python versions.
6 years ago
David Wilson 8f5b65f7ec issue #477: introduce subprocess isolation.
Since Python 2.4 fork is so defective, we must use subprocesses for
mitogen_task_isolation=fork. This has plenty of upside, since the long
term goal is to dump forking altogether. This allows a gentle
introduction of its replacement.
6 years ago
David Wilson b9924683ac ansible: docstring fixes. 6 years ago
David Wilson 75f53faf8c issue #477: shlex.split() in 2.4 required bytes input. 6 years ago
David Wilson dc1d4251e3 ansible: synchronize module needs '.docker_cmd' attr for Docker plugin. 6 years ago