requests/packages.py just imports urllib3 normally, then makes up new
names for it. pkgutil can't cope with that, and returns the loader
(builtin) for the requests package. The built-in loader obviously can't
find_module() for "requests/packages/urllib3/contrib/pyopenssl" because
it doesn't exist on disk.
If thread A is about to wake as thread B is about to sleep, and A loses
the GIL at an inopportune moment, it was possible for two latches to
share the same socketpair, causing wakeups routed to the wrong latch.
The pair was returned to the 'idle sockets' list before .recv() had been
called. This manifested as TimeoutError() thrown rarely with many active
threads and the host is heavily loaded (such as Travis CI).
Add more documentation and stop writing single wake bytes. Instead the
recipient's identity is written instead, making it simpler to detect
future bugs.
On 2.7 it was "accidentally fine" because the buffer object the StringIO
was initialized from happened to look like ASCII, but in 2.6 either
UCS-2 or UCS-4 is used for that buffer, and so the result was junk.
Just use the io module everywhere if we can, falling back to pure-Python
StringIO for Python<2.6.
Previously it was possible for a thread to call Waker.defer() after
Broker has torns its Waker down, and the underlying file descriptor
reallocated by the OS to some other component.
This manifested as latches of a subsequent test invocation receiving the
waker byte (' ') rather than their expected byte '\x7f'.
This doesn't fix the problem, it just significantly reduces the chance
of it occurring. In future Side.write()/read()/close() must be
synchronized with a lock.
Previously the problem could be reliably triggered with:
while :; do
python tests/call_function_test.py -vf CallFunctionTest.{test_aborted_on_local_broker_shutdown,test_aborted_on_local_context_disconnect}
done
e81b3bd0652b5eb125eb224ceca281b9d540dd5e
The whitelist check must happen /after/ the other checks, otherwise we
unconditionally retunr self for crap like 'ansible.module_utils.json'.
looks like this was just as broken on 2.x, and suddenly we're
finding a bunch more legit Django deps. It seems anywhere
absolute_import appeared in 2.x, we skipped some imports.
The old hack on the master side we had is broken for some reason on 3.x.
Instead tweak the client to be more selective: if a request is for a
module within a package, the package must be loaded (in sys.modules),
and its __loader__ must be us. Previously if the module didn't exist in
sys.modules, we'd still try to fetch from the master, which doesn't
appear to ever make sense.
* ansible: use unicode_literals everywhere since it only needs to be
compatible back to 2.6.
* compat/collections.py: delete this entirely and rip out the parts of
functools that require it.
* Introduce serializable Kwargs dict subclass that translates keys to
Unicode on instantiation.
* enable_debug_logging() must set _v/_vv globals.
* cStringIO does not exist in 3.x.
* Treat IOLogger and LogForwarder input as latin-1.
* Avoid ResourceWarnings in first stage by explicitly closing fps.
* Fix preamble_size.py syntax errors.
This appears to be harmless, except for Python 2.6 on Linux/Travis,
where for some reason (some stdlib change?) simply opening the TTY is
insufficient.
The 'versioner.c' dodging check added in 0ef23d86 was wrong, since the
check occurred on the host machine, when the fix actually needs to apply
to the Darwin target.
Fixes ability to target OS X from a Red Hat controller, manifesting as
an error like:
D mitogen: mitogen.parent.TtyLogStream('local.2472'): 'python(mitogen:dmw@localhost.localdomain:2449): realpath couldn\'t resolve "/usr/bin/python(mitogen:dmw@localhost.localdomain:2449)"'
The "realpath couldn't resolve" error comes from versioner.c:
https://opensource.apple.com/source/perl/perl-104/versioner/versioner.c
It's possible for a message to arrive after .add_handler() but before
Latch construction.
This is papering over a bigger problem with service pool instantiation.
https://travis-ci.org/dw/mitogen/jobs/390409832#L2901
TASK [Spin up a few interpreters] **********************************************
changed: [target] => (item=1)
ERROR! [pid 5355] 14:47:50.224945 E mitogen.ctx.ssh.localhost:2201.sudo.mitogen__user2: mitogen: Router(Broker(0x7f1e93911450))._invoke(Message(19100, 19095, 19095, 110, 1005, '\x80\x02U\x1fmitogen.service.PushFileServiceq\x01U\x11store_and_f'..8955)): <bound method Receiver._on_receive of Receiver(Router(Broker(0x7f1e93911450)), 110)> crashed
Traceback (most recent call last):
File "<stdin>", line 1471, in _invoke
File "<stdin>", line 491, in _on_receive
AttributeError: 'Receiver' object has no attribute '_latch'
If PushService.store_and_forward() loses the race to arrive at a brand
new context first, and the context's main thread is already executing a
CALL_FUNCTION that is blocked on the result of PushService, deadlock
could occur in the old scheme.
Instead (for now) simply spam a thread for each incoming message, and
use the get_or_create_pool() lock to ensure things work out in the end.
This could potentially generate a huge number of threads given the wrong
app, but we'll fix that problem when it appears.
The very first task /must/ be clearing out logging locks, since
_at_fork() functions call LOG.debug() via Side.close(). Additionally,
the root logger is not included in loggerDict, so we must specify it
explicitly.
This is likely to break something, it was definitely needed at some
point, but I never put much effort into figuring out why. Meanwhile,
Python appears to make find_module('ansible.module_utils.facts.')
requests in some circumstances, which causes us to indicate the module
exists while this hack exists.
So remove it, and let's see what breaks.
For lack of a better place to keep the client function, make it a
classmethod of FileService itself for now.
The old _get_file() is removed in a subsequent commit.
This is like FileService but blocks until the file is pushed by a parent
context, with deduplicating behaviour at each level in the hierarchy. It
does not stream large files, so it is only suitable for small files like
Python modules.
Additionally add SerializedInvoker for use with PushFileService, which
ensures all method calls to a single service occur in sequence.
It's easy to call msg.reply() by accident on a message that never had
reply_to set, resulting in a "invalid handle" error message coming from
router. Instead log a more accurate message on the stack that actualy
caused the problem.
With epoll() there is only one kernel-side object per file descriptor,
which is why _control() is such a pain. Since we merge receive/transmit
watching into that single object, we must always test the mask for both
conditions when reading results.
Kqueue isn't/doesn't appear to be like this. The identity of a Kqueue
event is keyed on (fd, filter), and we register a separate event for
both transmit and receive, so the 'elif' in KqueuePoller.poll() does not
appear to need to change.
Previously, a FD marked for read+write would not indicate writeability
until it was no longer readable.
To support detach, we must be able to preload the target with every
module it will need prior to detachment. This implements the
intermediary part of the process (i.e. the Ansible fork parent) --
receiving LOAD_MODULE/FORWARD_MODULE pairs and ensuring they reach the
child.
Bootstrap would hang if (as of writing) a pipe sufficient to hold 42,006
bytes was not handed out by the kernel to the first stage. It was luck
that this didn't manifest before, as first stage could write the full
source and exit completely before reading begun.
It is not clear under which circumstances this could previously occur,
but at least since Linux 4.5, it can be triggered if
/proc/sys/fs/pipe-max-size is reduced from the default of 1MiB, which
can have the effect of capping the default pipe buffer size of 64KiB to
something lower.
Suspicion is that buffer pipe size could also be reduced under memory
pressure, as reference to busy machines appeared a few times in the bug
report.
Ideally it would be possible to specify a callback function, but this is
not possible for proxied connections. So simply provide the 3 most
useful modes, defaulting to the most secure.
Closes#127. Closes#134.
In Ansible, depending on when CTRL+C is triggered, if it occurs after
the connection multiplexer process has forked, and after it has in turn
forked the "connection: local" context and its corresponding "clean fork
parent", since all the broker processes still belong to Ansible's
terminal foreground process group, they are all capable of receiving
SIGINT in response to CTRL+C being pressed on that terminal.
This papers over the problem. Really we want those KeyboardInterrupts to
be logged, to call setsid() frmo the connection multiplexer process to
isolate it from the terminal foreground process group. That way its only
indication of top-level process shutdown is using the graceful
disconnect mechanism that already exists in process.py::worker_main().
_sockets only refers to the idle sockets list, it doesn't refer to every
socket currently in use by a Latch, for example, the 2*16 used by e.g.
Ansible's sleeping service pool.
The GIL could be lost between the check for an empty list and popping a
socket off the list. Previously _tls_init (per its name) used per-thread
storage, hence the bug.
machinectl does not support any sensible form of pipe to the child
process, so it is necessary to bypass it when talking to a systemd
container (see systemd/systemd#8850).
This can also form the basis for issue #223, where the post-fork
namespace switching dance required to connect to the Pythonless
container will be the same.
mitogen/master.py:
Annotate forwarded log entries with their original source, logger
name, and message.
ansible:
mark stderr in red with -vvv
Tempting to make this appaer 100% of the time, but some crappy
bashrcs may cause lots of junk to be printed.
This change blocks off 2 common scenarios where a race condition is
upgraded to a hang, when the library could internally do better.
* Since we don't know whether the receiver of a `reply_to` is expecting
a raw or pickled message, and since in the case of a raw reply, there
is no way to signal "dead" to the receiver, override the reply_to
field to explicitly mark a message as dead using a special handle.
This replaces the serialized _DEAD sentinel value with a slightly
neater interface, in the form of the reserved IS_DEAD handle, and
enables an important subsequent change: when a context cannot route a
message, it can send a generic 'dead' reply back towards the message
source, ensuring any sleeping thread is woken with ChannelError.
The use of this field could potentially be extended later on if
additional flags are needed, but for now this seems to suffice.
* Teach Router._invoke() to reply with a dead message when it receives a
message for an invalid local handle.
* Teach Router._async_route() to reply with a dead message when it
receives an unroutable message.
There is no guarantee on the ordering select() returns file descriptors.
So if, e.g. in the case of sudo_nonexistent.yml, sudo prints an error
to a single FD before exitting, there was previously no gurantee
iter_read() would read off the error before failing due to detecting
disconnect on any FD.
Now instead we keep reading while any non-disconnected FD exists.
Presently there is still no mechanism to add :attr:`tty_stream` to the
multiplexer after connection is successful, but for now it's not
expected that anything will be logged to it anyway.
Closes#148.
Now Connection.close() *must* be called in the worker, to ensure the
reference count for a context drops correctly.
Remove 'discriminator' for now, I'm not using it for testing any more
and it complicated this code.
This code is a car crash, it needs rewritten again. Ideally some/most of
this behaviour could live on services.DeduplicatingService somehow, but
I couldn't come up with a sensible design.
Lots of "invalid handle: ..., 102" messages started appearing during
exit recently because ordering changed slightly, and local handles were
sent _DEAD even though the broker loop was still progressing through
shutdown.
The "shutdown" event is too early to close handles: it is the start of
the grace period where streams and downstream contexts can finish up any
work and deliver buffered data, including FORWARD_LOG messages that
haven't arrived yet.
So instead,
- move the _DEAD logic to the "exit" event,
- get rid of Context.on_shutdown() entirely, it's been unused for over
a month,
- get rid of the "crash" event, since it always fires prior to "exit",
and its only use was to send _DEAD to local handles, which now happens
during exit anyway.
Closes#105.
References #155.
mitogen/service.py:
Refactor services to support individually exposed methods with
different security policies for each method.
- @mitogen.service.expose() to expose a method and set its policy
- @mitogen.service.arg_spec() to validate input.
- Require basic service message format to be a tuple of
`(method, kwargs)`, where kwargs is always a dict.
- Update DeduplicatingService to match the new scheme.
ansible_mitogen/connection.py:
- Rename 'method' to 'method_name' to disambiguate it from the
service.call()'s method= argument.
ansible_mitogen/planner.py:
- Generate an ID for every job, sync or not, and fetch job results
from JobResultService rather than via the initiating function
call's return value.
- Planner subclasses now get to select whether their Runner should
run in a forked process. The base implementation requests this if
the 'mitogen_isolation_mode=fork' task variable is present.
ansible_mitogen/runner.py:
Teach runners to deliver their result via JobResultService executing
in their indirect parent mux process.
ansible_mitogen/plugins/actions/mitogen_async_status.py:
Split the implementation up into methods, and more compatibly
emulate Ansible's existing output.
ansible_mitogen/process.py:
Mux processes now host JobResultService.
ansible_mitogen/services.py:
Update existing services to the new mitogen.service scheme, and
implement JobResultService:
* listen() method for synchronous jobs. planner.invoke() registers a
Sender with the service prior to invoking the job, then sleeps
waiting for the service to write the job result to the
corresponding Receiver.
* Non-blocking get() method for implementing mitogen_async_status
action.
* Child-accessible push() method for delivering task results.
ansible_mitogen/target.py:
New helpers for spawning a virginal subprocess on startup, from
which asynchronous and mitogen_task_isolation=fork jobs are forked.
Necessary to avoid a task inheriting potentially
polluted/monkey-patched parent environment, since remaining jobs
continue to run in the original child process.
docs/ansible.rst:
Add/merge/remove some behaviours/risks.
tests/ansible/integration:
New tests for forking/async.
These are generated by any child calling .context_by_id() without
previously knowing the context's name, such as Contexts set in
ExternalContext.master and ExternalContext.parent.
When a stream (such as unix.connect()) has its auth_id set to the
current context's, we should allow those requests too, since the request
is working with the privilege of the current context.
Benefits:
- More correct than re.sub()
- Better handling of trailing whitespace
- Recognises doc-strings regardless of quoting style
Limitations:
- Still not entirely correct
- Creates a syntax error when function/class body is only a docstring
- Doesn't handle indented docstrings yet
- Slower by 50x - 8-10 ms vs 0.2 ms for re.sub()
- Not much scope for improving this, tokenize is 100% pure Python
- Complex state machine, harder to understand
- Higher line count in parent.py
- Untested with Mitogen parent on Python 2.x and child on Python 2.x+y
No change
- Only requires Python stdlib modules
This code path is probably only necessary during development, but it
prevents tracebacks (etc.) getting written over the Stream socket, which
naturally causes corruption.
Instead keep whatever the parent has for stderr, manually write a
traceback there and hard exit.