If thread A is about to wake as thread B is about to sleep, and A loses
the GIL at an inopportune moment, it was possible for two latches to
share the same socketpair, causing wakeups routed to the wrong latch.
The pair was returned to the 'idle sockets' list before .recv() had been
called. This manifested as TimeoutError() thrown rarely with many active
threads and the host is heavily loaded (such as Travis CI).
Add more documentation and stop writing single wake bytes. Instead the
recipient's identity is written instead, making it simpler to detect
future bugs.
Previously it was possible for a thread to call Waker.defer() after
Broker has torns its Waker down, and the underlying file descriptor
reallocated by the OS to some other component.
This manifested as latches of a subsequent test invocation receiving the
waker byte (' ') rather than their expected byte '\x7f'.
This doesn't fix the problem, it just significantly reduces the chance
of it occurring. In future Side.write()/read()/close() must be
synchronized with a lock.
Previously the problem could be reliably triggered with:
while :; do
python tests/call_function_test.py -vf CallFunctionTest.{test_aborted_on_local_broker_shutdown,test_aborted_on_local_context_disconnect}
done
e81b3bd0652b5eb125eb224ceca281b9d540dd5e
The whitelist check must happen /after/ the other checks, otherwise we
unconditionally retunr self for crap like 'ansible.module_utils.json'.
The old hack on the master side we had is broken for some reason on 3.x.
Instead tweak the client to be more selective: if a request is for a
module within a package, the package must be loaded (in sys.modules),
and its __loader__ must be us. Previously if the module didn't exist in
sys.modules, we'd still try to fetch from the master, which doesn't
appear to ever make sense.
* ansible: use unicode_literals everywhere since it only needs to be
compatible back to 2.6.
* compat/collections.py: delete this entirely and rip out the parts of
functools that require it.
* Introduce serializable Kwargs dict subclass that translates keys to
Unicode on instantiation.
* enable_debug_logging() must set _v/_vv globals.
* cStringIO does not exist in 3.x.
* Treat IOLogger and LogForwarder input as latin-1.
* Avoid ResourceWarnings in first stage by explicitly closing fps.
* Fix preamble_size.py syntax errors.
It's possible for a message to arrive after .add_handler() but before
Latch construction.
This is papering over a bigger problem with service pool instantiation.
https://travis-ci.org/dw/mitogen/jobs/390409832#L2901
TASK [Spin up a few interpreters] **********************************************
changed: [target] => (item=1)
ERROR! [pid 5355] 14:47:50.224945 E mitogen.ctx.ssh.localhost:2201.sudo.mitogen__user2: mitogen: Router(Broker(0x7f1e93911450))._invoke(Message(19100, 19095, 19095, 110, 1005, '\x80\x02U\x1fmitogen.service.PushFileServiceq\x01U\x11store_and_f'..8955)): <bound method Receiver._on_receive of Receiver(Router(Broker(0x7f1e93911450)), 110)> crashed
Traceback (most recent call last):
File "<stdin>", line 1471, in _invoke
File "<stdin>", line 491, in _on_receive
AttributeError: 'Receiver' object has no attribute '_latch'
If PushService.store_and_forward() loses the race to arrive at a brand
new context first, and the context's main thread is already executing a
CALL_FUNCTION that is blocked on the result of PushService, deadlock
could occur in the old scheme.
Instead (for now) simply spam a thread for each incoming message, and
use the get_or_create_pool() lock to ensure things work out in the end.
This could potentially generate a huge number of threads given the wrong
app, but we'll fix that problem when it appears.
This is likely to break something, it was definitely needed at some
point, but I never put much effort into figuring out why. Meanwhile,
Python appears to make find_module('ansible.module_utils.facts.')
requests in some circumstances, which causes us to indicate the module
exists while this hack exists.
So remove it, and let's see what breaks.
It's easy to call msg.reply() by accident on a message that never had
reply_to set, resulting in a "invalid handle" error message coming from
router. Instead log a more accurate message on the stack that actualy
caused the problem.
To support detach, we must be able to preload the target with every
module it will need prior to detachment. This implements the
intermediary part of the process (i.e. the Ansible fork parent) --
receiving LOAD_MODULE/FORWARD_MODULE pairs and ensuring they reach the
child.
Bootstrap would hang if (as of writing) a pipe sufficient to hold 42,006
bytes was not handed out by the kernel to the first stage. It was luck
that this didn't manifest before, as first stage could write the full
source and exit completely before reading begun.
It is not clear under which circumstances this could previously occur,
but at least since Linux 4.5, it can be triggered if
/proc/sys/fs/pipe-max-size is reduced from the default of 1MiB, which
can have the effect of capping the default pipe buffer size of 64KiB to
something lower.
Suspicion is that buffer pipe size could also be reduced under memory
pressure, as reference to busy machines appeared a few times in the bug
report.
In Ansible, depending on when CTRL+C is triggered, if it occurs after
the connection multiplexer process has forked, and after it has in turn
forked the "connection: local" context and its corresponding "clean fork
parent", since all the broker processes still belong to Ansible's
terminal foreground process group, they are all capable of receiving
SIGINT in response to CTRL+C being pressed on that terminal.
This papers over the problem. Really we want those KeyboardInterrupts to
be logged, to call setsid() frmo the connection multiplexer process to
isolate it from the terminal foreground process group. That way its only
indication of top-level process shutdown is using the graceful
disconnect mechanism that already exists in process.py::worker_main().
_sockets only refers to the idle sockets list, it doesn't refer to every
socket currently in use by a Latch, for example, the 2*16 used by e.g.
Ansible's sleeping service pool.
The GIL could be lost between the check for an empty list and popping a
socket off the list. Previously _tls_init (per its name) used per-thread
storage, hence the bug.
machinectl does not support any sensible form of pipe to the child
process, so it is necessary to bypass it when talking to a systemd
container (see systemd/systemd#8850).
This can also form the basis for issue #223, where the post-fork
namespace switching dance required to connect to the Pythonless
container will be the same.
This change blocks off 2 common scenarios where a race condition is
upgraded to a hang, when the library could internally do better.
* Since we don't know whether the receiver of a `reply_to` is expecting
a raw or pickled message, and since in the case of a raw reply, there
is no way to signal "dead" to the receiver, override the reply_to
field to explicitly mark a message as dead using a special handle.
This replaces the serialized _DEAD sentinel value with a slightly
neater interface, in the form of the reserved IS_DEAD handle, and
enables an important subsequent change: when a context cannot route a
message, it can send a generic 'dead' reply back towards the message
source, ensuring any sleeping thread is woken with ChannelError.
The use of this field could potentially be extended later on if
additional flags are needed, but for now this seems to suffice.
* Teach Router._invoke() to reply with a dead message when it receives a
message for an invalid local handle.
* Teach Router._async_route() to reply with a dead message when it
receives an unroutable message.
Lots of "invalid handle: ..., 102" messages started appearing during
exit recently because ordering changed slightly, and local handles were
sent _DEAD even though the broker loop was still progressing through
shutdown.
The "shutdown" event is too early to close handles: it is the start of
the grace period where streams and downstream contexts can finish up any
work and deliver buffered data, including FORWARD_LOG messages that
haven't arrived yet.
So instead,
- move the _DEAD logic to the "exit" event,
- get rid of Context.on_shutdown() entirely, it's been unused for over
a month,
- get rid of the "crash" event, since it always fires prior to "exit",
and its only use was to send _DEAD to local handles, which now happens
during exit anyway.
These are generated by any child calling .context_by_id() without
previously knowing the context's name, such as Contexts set in
ExternalContext.master and ExternalContext.parent.
When a stream (such as unix.connect()) has its auth_id set to the
current context's, we should allow those requests too, since the request
is working with the privilege of the current context.
Move defer handling out of Broker and into Waker (where it belongs?).
Now the lock must only be taken if Waker was actually woken.
Knocks 400-item run_hostname_100_times from 10.62s to 10.05s (-5.3%).
14:50:04 E mitogen: mitogen.fork.Stream('fork.7431') crashed
Traceback (most recent call last):
File "/home/dmw/src/mitogen/mitogen/core.py", line 1287, in _call
func(self)
File "/home/dmw/src/mitogen/mitogen/core.py", line 758, in on_receive
return self.on_disconnect(broker)
File "/home/dmw/src/mitogen/mitogen/parent.py", line 370, in on_disconnect
super(Stream, self).on_disconnect(broker)
File "/home/dmw/src/mitogen/mitogen/core.py", line 721, in on_disconnect
fire(self, 'disconnect')
File "/home/dmw/src/mitogen/mitogen/core.py", line 162, in fire
return [func(*args, **kwargs) for func in signals.get(name, ())]
File "/home/dmw/src/mitogen/mitogen/core.py", line 1160, in <lambda>
listen(stream, 'disconnect', lambda: self.on_stream_disconnect(stream))
File "/home/dmw/src/mitogen/mitogen/core.py", line 1142, in on_stream_disconnect
for context in self._context_by_id.itervalues():
RuntimeError: dictionary changed size during iteration
TLS destructors are not called after fork, therefore we must explicitly
track a global list of free file descriptors, and arrange for that list
to explicitly be destroyed from fork.py.
nested_test was failing due to the recent change to centralize
O_CLOEXEC, since stdout and stderr were being marked as non-inheritable.
That meant child processes would start with no stdout/stderr, triggering
a race between Waker opening its pipes, and IoLogger dup2'ing its pipes
over the stdio handles.
Since the stdio handles were closed, Waker would receive one of them as
one end of its pipe, and consequently have it overwritten by IoLogger.
When IoLogger dups over the top of fd 2, it becomes possible for
Waker.on_read() to be called due to pipe's other end to be closed,
causing an OSError exception with errno EAGAIN to appear.
This eliminates Context.on_disconnect() and instead moves its
functionality to a signal wired up by ExternalContext.main().
It leaves mitogen.master.Context is in a better condition to move into
mitogen.parent where it belongs.
* Split setup_globals() from setup_package() and make package setup
optional (fork never needs it -- synthetic package already exists in
children and the real package exists in masters).
* Add main() parameter to allow passing in the existing Importer
instance. In forks from children, this means we inherit all the cached
module state along with the __loader__ used to import any existing
modules.
Not sure why this wasn't done before, seems it should have always been
this way, and can't see any reason it wasn't. Without it, many fds are
leaked into at least .local() children. Closes#163.
* IDs are allocated by the parent responsible for contructing a new
child, using ALLOCATE_ID to the master as necessary to allocate new ID
ranges.
* ADD_ROUTE is sent up the tree rather than down. This permits
construction of the new context to complete concurrent to parent
contexts learning about its existence. Since all streams are strictly
ordered, it's not possible for any parent to observe messages from the
new context prior to arrival of an ADD_ROUTE from the parent notifying
of its existence.
If the new context, for example, implements an Ansible async task, its
parent can start executing that without waiting for any synchronous
confirmation from any parent or the master.
* Since routes propagate up, it's no longer possible for a plain
non-parent child to ever receive ADD_ROUTE, so that code can be moved
out of core.py and into parent.py (-0.2kb compressed).
* Add a .routes attribute to parent.Stream, and respond to disconnection
signal on the stream by propagating DEL_ROUTE for any ADD_ROUTE ever
received from that stream.
* Centralize route management in a new parent.RouteMonitor class
* Don't need to sleep if queue>sleepers, can just pop the right queue
element and return it.
* If queue>sleeping and waking==sleeping, no mechanism existed to ensure
a thread newly added to sleeping would ever be woken. Above change
fixes that.
* Cannot trust select() return value, scheduler might sleep us
indefinitely while put() writes a byte.
* Sleeping threads didn't pop FIFO, they popped in whatever order
scheduler woke them up. Must recover index and use it to pick the pop
index.
- If latch.get() is called and the queue is empty, a thread is put to
sleep.
- If Latch.put() from another thread then appends an item to the queue and
wakes the sleeping thread, and
- If a subsequent Latch.put() from the same or another thread manages to
acquire `lock` before the sleeping thread is scheduled,
- The sleeping thread's wake socket would have multiple bytes written to
it.
Therefore create a new _pending variable to track the only item assigned
to each thread (keyed by its write socket), and remove the socket from
`sleeping` from within put.
Implication: the entire message remains buffered until its last byte is
transmitted. Not wasting time on it, as there are pieces of work like
issue #6 that might invalidate these problems on the transmit path
entirely.
Rather than slowly build up a Python string over time, we just store a
deque of chunks (which, in a later commit, will now be around 128KB
each), and track the total buffer size in a separate integer.
The tricky loop is there to ensure the header does not need to be sliced
off the full message (which may be huge, causing yet another spike and
copy), but rather only off the much smaller first 128kb-sized chunk
received.
There is one more problem with this code: the ''.join() causes RAM usage
to temporarily double, but that was true of the old solution too. Shall
wait for bug reports before fixing this, as it gets very ugly very fast.
There is no penalty for just passing as much data to the OS as possible,
it is not copied, and for a non-blocking socket, the OS will just keep
buffer as much as it can and tell us how much that was.
Also avoids a rather pointless string slice.
Reduces the number of IO loop iterations required to receive large
messages at a small cost to RAM usage.
Note that when calling read() with a large buffer value like this,
Python must zero-allocate that much RAM. In other words, for even a
single byte received, 128kb of RAM might need to be written.
Consequently CHUNK_SIZE is quite a sensitive value and this might need
further tuning.
There is some insane unidentifiable Mitogen context (the local context?)
that instantly crashes with a higher forks setting. It appears to be
harmless, but meanwhile this naturally shouldn't be happening.
This actually addresses multiple problems:
* Single-file programs were broken, since the fix introduced in
6931cc10c4 caused builtin_find_module()
to start indicating __main__ can always be loaded locally. That's
broken, and there might be more cases where the same problem will crop
up.
Since it was indicated __main__ could be loaded locally, the built-in
import machinery was allowed to attempt that (since we remove __main__
from sys.modules during bootstrap), which caused a safety check to
fire in the bowels of Python:
"Cannot re-init internal module %.200s"
* The check for presence of the whitelist was totally broken, since the
whitelist is never an empty list. Therefore 'self' was being returned
for every module, including extension modules like 'termios'.
I have hand-verified this does not break the fix for issue #113. I
looked at writing a test for that, but it requires a Docker container
(or similar) with an ancient version of Ansible installed. Will open a
separate ticket tracking this.
Might want to de-overload the meaning of whitelist in future, but in
the meantime it works fine for Ansible and I can't think of a
whitelisting use case that would break because of it.
Closes#114.
Amazed this one managed to scrape through for so long. Calling
__import__ from within find_module() was causing the target module, in
this case cookielib, to be loaded *then overwritten* by a subsequent
duplicate load higher in the stack.
The result is that cookielib was loaded twice, and, per usual Python
import semantics, a reference to the partially initialized first
cookielib was installed in sys.modules while its code executed.
At the end of cookielib on 2.x, it imports _LWPCookieJar, which in turn
imports the partially built cookielib from sys.modules, then subclasses
the CookieJar from /that/ module.
Everything is wonderful. Then the call returns back up into the import
mechanism which restarts the entire process -- only this time,
_LWPCookieJar is /not/ reinitialized, so the copy in sys.modules is
still left with types pointing at the old module!
So the duplicate import creates a new CookieJar which is not the base
class of LWPCookieJar. Tada! 3 hours debugging.
This is probably a performance fix in disguise, didn't realize things
were so broken. It may also be a regression elsewhere. Urgently need to
finish the tests.
For the 52 submodules of ansible.modules.system, this produced a 1602
byte pkg_present list. After stripping it becomes 406 bytes, and the
entire LOAD_MODULE size drops from 1988 bytes to 792 bytes (-60%).
For the 68 submodules of ansible.module_utils, 1902 bytes pkg_present
becomes 474 bytes (-75%), and LOAD_MODULE size drops from 2867 bytes to
1439 bytes (-49%).
In a simple test running Ansible's "setup" module followed by its "apt"
module, wire bytes sent drops from 140,357 to 135,531 (-3.4%).
It looks ugly as sin, but this nets about a 20% drop in user CPU time,
and close to 15% increase in throughput.
The average log call is around 10 opcodes, prefixing with '_v and' costs
an extra 2, but both are simple operations, and the remaining 10 are
skipped entirely when _v or _vv are False.
Turns out it is far too easy to burn through available file descriptors,
so try something else: self-pipes are per thread, and only temporarily
associated with a Lack that wishes to sleep.
Reduce pointless locking by giving Latch its own queue, and removing
Queue.Queue() use in some places.
Temporarily undo merging of of Waker and Latch, let's do this one step
at a time.
On Python 2.x, operations on pthread objects with a timeout set actually
cause internal polling. When polling fails to yield a positive result,
it quickly backs off to a 50ms loop, which results in a huge amount of
latency throughout.
Instead, give up using Queue.Queue.get(timeout=...) and replace it with
the UNIX self-pipe trick. Knocks another 45% off my.yml in the Ansible
examples directory against a local VM.
This has the potential to burn a *lot* of file descriptors, but hell,
it's not the 1940s any more, RAM is all but infinite. I can live with
that.
This gets things down to around 75ms per playbook step, still hunting
for additional sources of latency.
Fix a MyPy warning by only passing lists to select.select(). At least on
Python 2.x, select.select() was internally converting the sets to lists
anyway.
By the time lists become inefficient here, it is likely that
select.select() itself will also be inefficient, and need replaced with
.poll() or similar.
No discernible performance different when transferring django.db.models
to a local VM.
* SIGTERM safety net prevents profiler from writing results, so disable
it when profiling is active.
* fix warning corrupting stream when profiling=True
Previously we'd send just None in GET_MODULE reply, but now since there
is no single request-reply structure, we must include the fullname in
the LOAD_MODULE response and make all of its data fields None to
indicate the same.
* Don't implement the rules for when preloading occurs yet
* Don't attempt to streamily preload modules downstream while this
context hasn't yet received the final module. There is quite
significant latency buried in here, but for now it's a lot of work to
fix.
This works well enough to handle at least the mitogen package, but it's
likely broken for anything bigger.
It seems gevent automatically sets blocking behaviour on fds produced by
the socket module, which causes the Python process we fork to fail
horribly. So in the child, always reset the blocking flag.
Although these are synonyms in Python 2.x, when using MyPy to typecheck
code use of file() causes spurious errors.
This commit also serves as one small step to Python 3.x compatibility,
since 3.x removes the file() builtin.
Without this, it's possible for Waker to be start_received() after the
shutdown signal has already been sent, resulting in 5 second delay
during shutdown.
Additionally mask EBADF during os.write() to waker's write side.
Necessary since nothing synchronizes writer threads from the broker
thread during shutdown. Could be done with a lock instead, but this is
cheaper.
Now there is a separate SHUTDOWN message that relies only on being
received by the broker thread, the main thread can be hung horribly and
the process will still eventually receive a SIGTERM.
If no ADD_ROUTE message has been received from the master associating a
stream with a particular context ID, then it is expected messages
originating from that context ID can only be routed via the parent.
This probably worsens performance in the common case, but it prevents
runaway producers (see e.g. issue #36) from spending all their CPU
copying around huge strings.
It's also a small step towards a solution to issue #6, which will
replace the output buffer with some sort of fancier queue anyway.
This reduces a particular 40 second run of rsync to 1.5 seconds.
The last time I tested set_nonblock() as a fix for the rsync hang, I
used F_SETFD rather than F_SETFL, which resulted in no error, but also
did not set O_NONBLOCK. Turns out missing O_NONBLOCK was the problem.
The rsync hang was due to every context blocking in os.write() waiting
for either a parent or child buffer to empty, which was exacerbated by
rsync's own pipelining, that allows writes from both sides to proceed
even while reads aren't progressing. The hang was due to os.write() on a
blocking fd blocking until buffer space is available to complete the
write. Partial writes are only supported when O_NONBLOCK is enabled.
* Support passing Context() objects in function calls and return values.
Now the fakessh demo from the documentation index would work
correctly.
* Since slaves can communicate with each other now, they should also use
the same approach to unpickling as the master already used. Collapse
away all the unpickle extension crap and hard-wire just the 3 types
that support unpickling.