For "ansible -m setup" over a 25ms link, avoids 65 roundtrips and
reduces runtime from 5.7s to 4.1s (-28%).
For "ansible -m setup" over a simulated 250 ms link, reduces runtime
from m27.015s to 0m8.254s (-69%).
This may come back to bite later, but in the meantime it avoids shipping
up to 12KiB of junk metadata for every single task invocation.
For detachment (aka. async), we must ensure the target has two types of
preloads completed (modules and module_utils files) before detaching.
The OpenShift installer modifies /etc/resolv.conf then tests the new
resolver configuration, however, there was no mechanism to reload
resolv.conf in our reuseable interpreter.
https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/openshift_web_console/tasks/install.yml#L137
This inserts an explicit call to res_init() for every new style
invocation, with an approximate cost of ~1usec on Linux since glibc
verifies resolv.conf has changed before reloading it.
There is little to be done for users of the thread-safe resolver APIs,
their state is hidden from us. If bugs like that manifest, whack-a-mole
style 'del sys.modules[thatmod]' patches may suffice.
Traced git log all the way back to beginning of time, and checked
Ansible versions starting Jan 2016. Zero clue where this came from, but
the convention suggests it came from Ansible at some point.
While adding support for non-new style module types, NewStyleRunner
began writing modules to a temporary file, and sys.argv was patched to
actually include the script filename. The argv change was never required
to fix any particular bug, and a search of the standard modules reveals
no argv users. Update argv[0] to be '', like an interactive interpreter
would have.
While fixing #210, new style runner began setting __file__ to the
temporary file path in order to allow apt.py to discover the Ansiballz
temporary directory. 5 out of 1,516 standard modules follow this
pattern, but in each case, none actually attempt to access __file__,
they just call dirname on it. Therefore do not write the contents of
file, simply set it to the path as it would exist, within a real
temporary directory.
Finally move temporary directory creation out of runner and into target.
Now a single directory exists for the duration of a run, and is emptied
by runner.py as necessary after each task invocation.
This could be further extended to stop rewriting non-new-style modules
in a with_items loop, but that's another step.
Finally the last bullet point in the documentation almost isn't a lie
again.
Ideally it would be possible to specify a callback function, but this is
not possible for proxied connections. So simply provide the 3 most
useful modes, defaulting to the most secure.
Closes#127. Closes#134.
mitogen/master.py:
Annotate forwarded log entries with their original source, logger
name, and message.
ansible:
mark stderr in red with -vvv
Tempting to make this appaer 100% of the time, but some crappy
bashrcs may cause lots of junk to be printed.
This implements the first edition of Connection Delegation, where
delegating connection establishment is initially single-threaded.
ansible_mitogen/strategy.py:
ansible_mitogen/plugins/connection/*:
Begin splitting connection.Connection into subclasses, exposing them
directly as "mitogen_ssh", "mitogen_local", etc. connection types.
This is far from removing strategy.py, but it's a tiny start.
ansible_mitogen/connection.py:
* config_from_play_context() and config_from_host_vars() build up a
huge dictionary containing either more or less PlayContext contents,
or our best attempt at reconstructing a host's connection config
from its hostvars, where that config is not the current
WorkerProcess target.
They both produce the same format with the same keys, allowing
remaining code to have a single input format.
These dicts contain fields named after how Ansible refers to them,
e.g. "sudo_exe".
* _config_from_via() parses a basic connection specification like
"username@inventory_name" into one of the aforementioned dicts.
* _stack_from_config() produces a list of dicts describing the order
in which (Mitogen) connections should be established, such that each
element is proxied via= the previous element. The dicts produced by
this function use Mitogen keyword arguments, the former di.
These dicts contain fields named after how Mitogen refers to them,
e.g. "sudo_path".
* Pass the stack to ContextService, which is responsible for actual
setup of the full chain.
ansible_mitogen/services.py:
Teach get() to walk the supplied stack, establishing each connection
in turn, creating refounts for it before continuing.
TODO: refcounting is broken in a variety of cases.
This commit only uses it for the target.get_file() helper, which is only
used for transferring modules. The next commit wires it into the
Connection.transfer_file() API, which is the method the copy module
uses.
The module name comes from YAML via Jinja2.. it's always Unicode. Mixing
it into a temporary directory name produces a Unicode tempdir name,
which ends up in sys.argv via TemporaryArgv.
This is a partial fix, there are still at least 2 cases needing covered:
- In-progress connections must have CallError or similar sent to any
waiters
- Once connection delegation exists, it is possible for other worker
processes to be active (and in any step in the process), trying to
communicate with a context that we know can no longer be communicated
with. The solution to that isn't clear yet.
Additionally ensure root has /bin/bash shell in both Docker images.
And by "compatible" I mean "terrible". This does not implement async job
timeouts, but I'm not going to bother, upstream async implementation is
so buggy and inconsistent it resists even having its behaviour captured
in tests.
Now Connection.close() *must* be called in the worker, to ensure the
reference count for a context drops correctly.
Remove 'discriminator' for now, I'm not using it for testing any more
and it complicated this code.
This code is a car crash, it needs rewritten again. Ideally some/most of
this behaviour could live on services.DeduplicatingService somehow, but
I couldn't come up with a sensible design.
Elements of a with_items loop reuse one WorkerProcess to execute every
iteration, requiring us to reset Connection's idea of the connection on
each iteration, otherwise the tasks will erroneously execute in the
wrong context.
Closes#105.
References #155.
mitogen/service.py:
Refactor services to support individually exposed methods with
different security policies for each method.
- @mitogen.service.expose() to expose a method and set its policy
- @mitogen.service.arg_spec() to validate input.
- Require basic service message format to be a tuple of
`(method, kwargs)`, where kwargs is always a dict.
- Update DeduplicatingService to match the new scheme.
ansible_mitogen/connection.py:
- Rename 'method' to 'method_name' to disambiguate it from the
service.call()'s method= argument.
ansible_mitogen/planner.py:
- Generate an ID for every job, sync or not, and fetch job results
from JobResultService rather than via the initiating function
call's return value.
- Planner subclasses now get to select whether their Runner should
run in a forked process. The base implementation requests this if
the 'mitogen_isolation_mode=fork' task variable is present.
ansible_mitogen/runner.py:
Teach runners to deliver their result via JobResultService executing
in their indirect parent mux process.
ansible_mitogen/plugins/actions/mitogen_async_status.py:
Split the implementation up into methods, and more compatibly
emulate Ansible's existing output.
ansible_mitogen/process.py:
Mux processes now host JobResultService.
ansible_mitogen/services.py:
Update existing services to the new mitogen.service scheme, and
implement JobResultService:
* listen() method for synchronous jobs. planner.invoke() registers a
Sender with the service prior to invoking the job, then sleeps
waiting for the service to write the job result to the
corresponding Receiver.
* Non-blocking get() method for implementing mitogen_async_status
action.
* Child-accessible push() method for delivering task results.
ansible_mitogen/target.py:
New helpers for spawning a virginal subprocess on startup, from
which asynchronous and mitogen_task_isolation=fork jobs are forked.
Necessary to avoid a task inheriting potentially
polluted/monkey-patched parent environment, since remaining jobs
continue to run in the original child process.
docs/ansible.rst:
Add/merge/remove some behaviours/risks.
tests/ansible/integration:
New tests for forking/async.
Before:
$ ANSIBLE_STRATEGY=mitogen ansible -i derp, derp -m setup
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: (''.join(bits)[-300:],)
derp | FAILED! => {
"msg": "Unexpected failure during module execution.",
"stdout": ""
}
After:
$ ANSIBLE_STRATEGY=mitogen ansible -i derp, derp -m setup
derp | UNREACHABLE! => {
"changed": false,
"msg": "EOF on stream; last 300 bytes received: 'ssh: Could not resolve hostname derp: nodename nor servname provided, or not known\\r\\n'",
"unreachable": true
}
* Use identical logic to select when stdout/stderr are merged, so
'stdout', 'stdout_lines', 'stderr', 'stderr_lines' contain the same
output before/after the extension.
* When stdout/stderr are merged, synthesize carriage returns just like
the TTY layer.
* Mimic the SSH connection multiplexing message on stderr. Not really
for user code, but so compare_output_test.sh needs fewer fixups.
Rather than assume any structure about the Python code:
* Delete the exit_json/fail_json monkeypatches.
* Patch SystemExit rather than a magic monkeypatch-thrown exception
* Setup fake cStringIO stdin, stdout, stderr and return those along with
SystemExit exit status
* Setup _ANSIBLE_ARGS as we used to, since we still want to override
that with '{}' to prevent accidental import hangs, but also provide
the same string via sys.stdin.
* Compile the module bytecode once and re-execute it for every
invocation. May change this back again later, once some benchmarks are
done.
* Remove the fixups stuff for now, it's handled by ^ above.
Should support any "somewhat new style" Python module, including those
that just give up and dump stuff to stdout directly.
Refactor planner.py to look a lot more like runner.py. This 'structural
cutpaste' looks messy -- probably we can simplify this code, even though
it's pretty simple already.
* Add helpers.get_file() that calls back up into FileService as
necessary. This is a stopgap measure.
* Add logging to exec_args() to simplify debugging of binary runners.
It looks a lot like multiple calls to _make_tmp_path() will result in
multiple temporary directories on the remote machine, only the last of
which will be cleaned up.
We must be bug-for-bug compatible for now, so ignore the problem in the
meantime.
Implement Connection.__del__, which is almost certainly going to trigger
more bugs down the line, because the state of the Connection instance is
not guranteed during __del__. Meanwhile, it is temporarily needed for
deployed-today Ansibles that have a buggy synchronize action that does
not call Connection.close().
A better approach to this would be to virtualize the guts of Connection,
and move its management to one central place where we can guarantee
resource destruction happens reliably, but that may entail another
Ansible monkey-patch to give us such a reliable hook.
The strategy is reconstructed for every playbook that is included or
specified on the command line, therefore we can't store the global
Router there without losing all our SSH connections across playbooks.
Turns out Ansible can't be trusted to actually check the result
dictionary everywhere it expects one, so put the real exception text
into -vvv output too.
Ansible's PluginLoader makes up bullshit when it imports a module
(mostly because it has to make up something), therefore we ended up with
duplicate copies of ansible_mitogen loaded: one under
ansible.plugins.*.mitogen, and one under the canonical namespace.
Which broke isinstance().
It's used at least by the copy module, even though the result is still
mostly a no-op. _remote_chmod() doesn't accept octal mode, it accepts
symbolic mode. So implement a symbolic parser in helpers.py.
Still needs a ton of work to emulate argument handling, shell selection,
and output emulation in every case. Unsurprisingly, Ansible documents
none of this.
On Python 2.x, operations on pthread objects with a timeout set actually
cause internal polling. When polling fails to yield a positive result,
it quickly backs off to a 50ms loop, which results in a huge amount of
latency throughout.
Instead, give up using Queue.Queue.get(timeout=...) and replace it with
the UNIX self-pipe trick. Knocks another 45% off my.yml in the Ansible
examples directory against a local VM.
This has the potential to burn a *lot* of file descriptors, but hell,
it's not the 1940s any more, RAM is all but infinite. I can live with
that.
This gets things down to around 75ms per playbook step, still hunting
for additional sources of latency.