This is needed to cope Ansible 2.3 doing weird stuff as usual. It serves
up __init__.py for ansible and ansible.module_utils as hard-coded
namespace packages, the real ansible/__init__.py on disk is not 2.4
compatible.
Making CallError inherit from object broke 'raise CallError()'.
Instead use pure-Python pickler on 2.4 (grmbl) and force it to emit
new-style-alike output for what is otherwise a classic class.
Remove needless complexity from _unpickle_call_error() that only worked
for new-style classes.
- don't try anything unless something really lives in sys.modules by
that name
- non-ASCII files are possible
- the unimportable thing might be an extension module, we don't want
that
For join_thread():
Exception in thread mitogen.master.join_thread_async:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/dmw/src/mitogen/mitogen/master.py", line 249, in _watch
watcher.on_join()
File "/home/dmw/src/mitogen/mitogen/master.py", line 816, in shutdown
super(Broker, self).shutdown()
File "/home/dmw/src/mitogen/mitogen/core.py", line 2741, in shutdown
self.defer(_shutdown)
File "/home/dmw/src/mitogen/mitogen/core.py", line 2142, in defer
raise Error(self.broker_shutdown_msg)
Error: An attempt was made to enqueue a message with a Broker that has already exitted. It is likely your program called Broker.shutdown() too early.
Allow messages to continue being queued during the shutdown period,
right up until the final loop iteration, even though this is racy, as
too many things depend on .defer() during exit right now.
This doesn't hurt the spirit of the check: it still catches the worst
situation where $user accidentally shut down Broker then tried to
continue using it.
Python at some point (at least since https://bugs.python.org/issue14605)
began populating sys.meta_path with its internal importer classes,
meaning that interpreters no longer start with an empty sys.meta_path.
Ideally it would only be called once, and in future maybe it can, but
right now we need to cope with these cases:
* Downstream parent notifies us of disconnection (DEL_ROUTE)
* We notify ourself of disconnection
* We notify ourself and so does downstream parent
It's case 3 that causes the error.
When Stream.connect() fails, have it just use on_disconnect(). Now there
is a single disconnect cleanup path.
Remove cutpasted DiagLogStream setup/destruction, and move it into the
base class (temporarily), and only manage the lifetime of its underlying
FD via Side.close(). This cures another EBADF failure.
The previous approach was crap since it left e.g. socketpair instances
lying around for GC with their underlying FD already closed, coupled
with FD number reuse, led to random madness when GC finally runs.
Using _lock we can know for certain whether the Broker has received a
wakeup byte yet. If it has, we can skip the wasted system call.
Now on_receive() can exactly read the single byte that can possibly
exist (modulo FD sharing bugs -- this could be improved on later)
Now poller is start enough to know a start_receive() during an iteration
does not cause events yielded by that iteration to associate with the
wrong descriptor.
These changes are tangentially related to the associated ticket, but
event versioning is still the underlying issue.
The user@host prefix in new-style OpenSSH messages unfortunately takes
the host part from ~/.ssh/config and friends. There is no way to know
which hostname will appear in this string without parsing the OpenSSH
config, nor which username will appear.
Instead just regex it.
Add SSH stub modes to print the new/old errors and add some simple
tests.
This extends the work done in b9112a9cbb
Receiving DEL_ROUTE without a corresponding ADD_ROUTE is now legit
behaviour, so don't print an error in this case.
Don't print an error for dropped messages if the reply_to indicates the
sender doesn't care about a response (dead and no_reply)
Earlier commit moved Stream.routes attribute into a private map
belonging to RouteMonitor, to make upgrades smoother. This adds a new
accessor method to RouteMonitor.
Now rather than simply propagate DEL_ROUTE upwards towards the parent,
we broadcast it downward to any stream that ever sent a message toward
any of the routes that have just become disconnected.
When unpickling a context, arrange for there to be a single instance
representing that context, managed by the corresponding router. This
context_by_id() was already in use by parent.py, it just needs to move
down.
This to eventually reach the point where a single Context exists that
needs 'disconnect' fired on it, so all sleeping receivers are definitely
woken.
(Pull #377)
Changes:
- additional_parameters -> extra_args
- Merge with kubectl changes from dmw branch
- Update docs
- Remove unused username class member
- Avoid mutable kubectl_args class member
- Use six.iteritems
This change allows the kubectl connector to support the same options as
Ansible's original connector.
The playbook sample comes with an example of a pod containing two containers
and checking that moving from one container to another, the version of Python
changes as expected.
OpenSSH 7.5 changed the text of the permission denied message. As a
result ssh_test.SshTest.test_password_required and test_pubkey_required
were failing on an Ubuntu 18.04 client, which ships OpenSSH 7.6.
Refs
- https://bugzilla.mindrot.org/show_bug.cgi?id=2720
In some scenarios, Ansible's worker seems to exit early, resulting in
EPIPE during .recv() or .send(). Log an error and gracefully disconnect
in that case.
The connection multiplexer can expect to not be scheduled at least until
every $forks worker processes has attempted a connection, so the backlog
must be able to hold every worker.
* Always enable the faulthandler module in the top-level process if it
is available.
* Make MITOGEN_DUMP_THREAD_STACKS interval configurable, to better
handle larger runs.
* Add docs subsection on diagnosing hangs.
Conflicts:
ansible_mitogen/process.py
Without this, an invocation like:
sudo ansible-playbook foo.yml
Where foo.yml uses setns, could inherit the HOME environment variable
from the external non-root user, which broke /usr/bin/mysql_upgrade and
plenty more.
There were two problems with detection and handling of class methods as call targets in Python 3:
* Methods no longer define `im_self` -- this is now only `__self__`
* The `types` module no longer defines a `ClassType`
The universally-compatible (v2.6+) solution was to switch to using the `inspect` module -- whose interface has been stable -- and to checking the method attribute `__self__`.
(It doesn't hurt that `inspect` checks are more brief and we now no longer need the `types` module here.)
Unclear whether or not this is a hack, or whether it should be the
default for more connection methods. When enabled, the exception text
thrown when bootstrap fails includes the stderr text, which is
apparently always useful.
Since BasicStream.close() invokes _stop_transmit() followed by
os.close(), and KqueuePoller._stop_transmit() defers the unsubscription
until the IO loop resumes, kqueue generates an error event for the
associated FD, even though the changelist includes an unsubscription
command for the FD.
We could fix this by deferring close() until after the IO loop has run
once (simply by calling .defer()), but that generates extra wakeups for
no real reason.
Instead simply notice the error event and log it, rather than treating
it as a legitimate event.
Another approach to fixing this would be to process
_stop_receive()/_stop_transmit() eagerly, however that entails making
more syscalls.
Closes#320.
requests/packages.py just imports urllib3 normally, then makes up new
names for it. pkgutil can't cope with that, and returns the loader
(builtin) for the requests package. The built-in loader obviously can't
find_module() for "requests/packages/urllib3/contrib/pyopenssl" because
it doesn't exist on disk.
If thread A is about to wake as thread B is about to sleep, and A loses
the GIL at an inopportune moment, it was possible for two latches to
share the same socketpair, causing wakeups routed to the wrong latch.
The pair was returned to the 'idle sockets' list before .recv() had been
called. This manifested as TimeoutError() thrown rarely with many active
threads and the host is heavily loaded (such as Travis CI).
Add more documentation and stop writing single wake bytes. Instead the
recipient's identity is written instead, making it simpler to detect
future bugs.
On 2.7 it was "accidentally fine" because the buffer object the StringIO
was initialized from happened to look like ASCII, but in 2.6 either
UCS-2 or UCS-4 is used for that buffer, and so the result was junk.
Just use the io module everywhere if we can, falling back to pure-Python
StringIO for Python<2.6.
Previously it was possible for a thread to call Waker.defer() after
Broker has torns its Waker down, and the underlying file descriptor
reallocated by the OS to some other component.
This manifested as latches of a subsequent test invocation receiving the
waker byte (' ') rather than their expected byte '\x7f'.
This doesn't fix the problem, it just significantly reduces the chance
of it occurring. In future Side.write()/read()/close() must be
synchronized with a lock.
Previously the problem could be reliably triggered with:
while :; do
python tests/call_function_test.py -vf CallFunctionTest.{test_aborted_on_local_broker_shutdown,test_aborted_on_local_context_disconnect}
done
e81b3bd0652b5eb125eb224ceca281b9d540dd5e
The whitelist check must happen /after/ the other checks, otherwise we
unconditionally retunr self for crap like 'ansible.module_utils.json'.
looks like this was just as broken on 2.x, and suddenly we're
finding a bunch more legit Django deps. It seems anywhere
absolute_import appeared in 2.x, we skipped some imports.
The old hack on the master side we had is broken for some reason on 3.x.
Instead tweak the client to be more selective: if a request is for a
module within a package, the package must be loaded (in sys.modules),
and its __loader__ must be us. Previously if the module didn't exist in
sys.modules, we'd still try to fetch from the master, which doesn't
appear to ever make sense.
* ansible: use unicode_literals everywhere since it only needs to be
compatible back to 2.6.
* compat/collections.py: delete this entirely and rip out the parts of
functools that require it.
* Introduce serializable Kwargs dict subclass that translates keys to
Unicode on instantiation.
* enable_debug_logging() must set _v/_vv globals.
* cStringIO does not exist in 3.x.
* Treat IOLogger and LogForwarder input as latin-1.
* Avoid ResourceWarnings in first stage by explicitly closing fps.
* Fix preamble_size.py syntax errors.
This appears to be harmless, except for Python 2.6 on Linux/Travis,
where for some reason (some stdlib change?) simply opening the TTY is
insufficient.
The 'versioner.c' dodging check added in 0ef23d86 was wrong, since the
check occurred on the host machine, when the fix actually needs to apply
to the Darwin target.
Fixes ability to target OS X from a Red Hat controller, manifesting as
an error like:
D mitogen: mitogen.parent.TtyLogStream('local.2472'): 'python(mitogen:dmw@localhost.localdomain:2449): realpath couldn\'t resolve "/usr/bin/python(mitogen:dmw@localhost.localdomain:2449)"'
The "realpath couldn't resolve" error comes from versioner.c:
https://opensource.apple.com/source/perl/perl-104/versioner/versioner.c
It's possible for a message to arrive after .add_handler() but before
Latch construction.
This is papering over a bigger problem with service pool instantiation.
https://travis-ci.org/dw/mitogen/jobs/390409832#L2901
TASK [Spin up a few interpreters] **********************************************
changed: [target] => (item=1)
ERROR! [pid 5355] 14:47:50.224945 E mitogen.ctx.ssh.localhost:2201.sudo.mitogen__user2: mitogen: Router(Broker(0x7f1e93911450))._invoke(Message(19100, 19095, 19095, 110, 1005, '\x80\x02U\x1fmitogen.service.PushFileServiceq\x01U\x11store_and_f'..8955)): <bound method Receiver._on_receive of Receiver(Router(Broker(0x7f1e93911450)), 110)> crashed
Traceback (most recent call last):
File "<stdin>", line 1471, in _invoke
File "<stdin>", line 491, in _on_receive
AttributeError: 'Receiver' object has no attribute '_latch'
If PushService.store_and_forward() loses the race to arrive at a brand
new context first, and the context's main thread is already executing a
CALL_FUNCTION that is blocked on the result of PushService, deadlock
could occur in the old scheme.
Instead (for now) simply spam a thread for each incoming message, and
use the get_or_create_pool() lock to ensure things work out in the end.
This could potentially generate a huge number of threads given the wrong
app, but we'll fix that problem when it appears.
The very first task /must/ be clearing out logging locks, since
_at_fork() functions call LOG.debug() via Side.close(). Additionally,
the root logger is not included in loggerDict, so we must specify it
explicitly.
This is likely to break something, it was definitely needed at some
point, but I never put much effort into figuring out why. Meanwhile,
Python appears to make find_module('ansible.module_utils.facts.')
requests in some circumstances, which causes us to indicate the module
exists while this hack exists.
So remove it, and let's see what breaks.
For lack of a better place to keep the client function, make it a
classmethod of FileService itself for now.
The old _get_file() is removed in a subsequent commit.
This is like FileService but blocks until the file is pushed by a parent
context, with deduplicating behaviour at each level in the hierarchy. It
does not stream large files, so it is only suitable for small files like
Python modules.
Additionally add SerializedInvoker for use with PushFileService, which
ensures all method calls to a single service occur in sequence.
It's easy to call msg.reply() by accident on a message that never had
reply_to set, resulting in a "invalid handle" error message coming from
router. Instead log a more accurate message on the stack that actualy
caused the problem.
With epoll() there is only one kernel-side object per file descriptor,
which is why _control() is such a pain. Since we merge receive/transmit
watching into that single object, we must always test the mask for both
conditions when reading results.
Kqueue isn't/doesn't appear to be like this. The identity of a Kqueue
event is keyed on (fd, filter), and we register a separate event for
both transmit and receive, so the 'elif' in KqueuePoller.poll() does not
appear to need to change.
Previously, a FD marked for read+write would not indicate writeability
until it was no longer readable.
To support detach, we must be able to preload the target with every
module it will need prior to detachment. This implements the
intermediary part of the process (i.e. the Ansible fork parent) --
receiving LOAD_MODULE/FORWARD_MODULE pairs and ensuring they reach the
child.
Bootstrap would hang if (as of writing) a pipe sufficient to hold 42,006
bytes was not handed out by the kernel to the first stage. It was luck
that this didn't manifest before, as first stage could write the full
source and exit completely before reading begun.
It is not clear under which circumstances this could previously occur,
but at least since Linux 4.5, it can be triggered if
/proc/sys/fs/pipe-max-size is reduced from the default of 1MiB, which
can have the effect of capping the default pipe buffer size of 64KiB to
something lower.
Suspicion is that buffer pipe size could also be reduced under memory
pressure, as reference to busy machines appeared a few times in the bug
report.
Ideally it would be possible to specify a callback function, but this is
not possible for proxied connections. So simply provide the 3 most
useful modes, defaulting to the most secure.
Closes#127. Closes#134.