It knows how to dispatch messages from multiple receivers (associated
with multiple services) to multiple threads, where the service
implementation is invoked on the message.
It wakes a maximum of one thread per received message.
It knows how to shut down gracefully.
Implication: due to the latch use, there are 2 file descriptors burned
for every thread. We don't need interruptibility here, so in future, it
might be nice to allow swapping a diferent queueing primitive into
Select (maybe a subclass?) just for this case.
Implication: the entire message remains buffered until its last byte is
transmitted. Not wasting time on it, as there are pieces of work like
issue #6 that might invalidate these problems on the transmit path
entirely.
Rather than slowly build up a Python string over time, we just store a
deque of chunks (which, in a later commit, will now be around 128KB
each), and track the total buffer size in a separate integer.
The tricky loop is there to ensure the header does not need to be sliced
off the full message (which may be huge, causing yet another spike and
copy), but rather only off the much smaller first 128kb-sized chunk
received.
There is one more problem with this code: the ''.join() causes RAM usage
to temporarily double, but that was true of the old solution too. Shall
wait for bug reports before fixing this, as it gets very ugly very fast.
There is no penalty for just passing as much data to the OS as possible,
it is not copied, and for a non-blocking socket, the OS will just keep
buffer as much as it can and tell us how much that was.
Also avoids a rather pointless string slice.
Reduces the number of IO loop iterations required to receive large
messages at a small cost to RAM usage.
Note that when calling read() with a large buffer value like this,
Python must zero-allocate that much RAM. In other words, for even a
single byte received, 128kb of RAM might need to be written.
Consequently CHUNK_SIZE is quite a sensitive value and this might need
further tuning.
accept() (per interface) returns a non-blocking socket because the
listener socket is in non-blocking mode, therefore it is pure scheduling
luck that a connecting-in child has a chance to write anything for the
top-level processs to read during the subsequent .recv().
A higher forks setting in ansible.cfg was enough to cause our luck to
run out, causing the .recv() to crashi with EGAIN, and the multiplexer
to respond to the handler's crash by calling its disconnect method. This
is why some reports mentioned ECONNREFUSED -- the listener really was
gone, because its Stream class had crashed.
Meanwhile since the window where we're waiting for the remote process to
identify itself is tiny, simply flip off O_NONBLOCK for the duration of
the connection handshake. Stream.accept() (via Side.__init__) will
reenable O_NONBLOCK for the descriptors it duplicates, so we don't even
need to bother turning this back off.
A better solution entails splitting Stream up into a state machine and
doing the handshake with non-blocking IO, but that isn't going to be
available until asynchronous connect is implemented. Meanwhile in
reality this solution is probably 100% fine.
There is some insane unidentifiable Mitogen context (the local context?)
that instantly crashes with a higher forks setting. It appears to be
harmless, but meanwhile this naturally shouldn't be happening.
When a Broker() is running with install_watcher=True, arrange for only
one watcher thread to exist for each target thread, and to reset the
mapping of watchers to targets after process fork.
This is probably the last change I want to make to the watcher feature
before deciding to rip it out, it may be more trouble than it is worth.
SSH command size: 439 (+4 bytes)
Preamble size: 8941 (no change)
This _increases_ the size of the first stage, but
- Eliminates one of the two remaining uses of `sys`
- Reads the preamble as a byte-string, no call `.encode()`
is needed on Python 3 before calling `_()`
SSH command size: 435 (-4 bytes)
Preamble size: 8962 (no change)
os.execl is the same as os.execv, but it take a variable number of
arguments instead of a single sequence.
SSH command size: 448 (-5 bytes)
Preamble size: 8941 (no change)
NB: The 'zip' alias was absent in Python 3.x, until Python 3.4. This
should change be reverted if Python 3.0, 3.2, or 3.3 support is
required.
This actually addresses multiple problems:
* Single-file programs were broken, since the fix introduced in
6931cc10c4 caused builtin_find_module()
to start indicating __main__ can always be loaded locally. That's
broken, and there might be more cases where the same problem will crop
up.
Since it was indicated __main__ could be loaded locally, the built-in
import machinery was allowed to attempt that (since we remove __main__
from sys.modules during bootstrap), which caused a safety check to
fire in the bowels of Python:
"Cannot re-init internal module %.200s"
* The check for presence of the whitelist was totally broken, since the
whitelist is never an empty list. Therefore 'self' was being returned
for every module, including extension modules like 'termios'.
I have hand-verified this does not break the fix for issue #113. I
looked at writing a test for that, but it requires a Docker container
(or similar) with an ancient version of Ansible installed. Will open a
separate ticket tracking this.
Might want to de-overload the meaning of whitelist in future, but in
the meantime it works fine for Ansible and I can't think of a
whitelisting use case that would break because of it.
Closes#114.
Amazed this one managed to scrape through for so long. Calling
__import__ from within find_module() was causing the target module, in
this case cookielib, to be loaded *then overwritten* by a subsequent
duplicate load higher in the stack.
The result is that cookielib was loaded twice, and, per usual Python
import semantics, a reference to the partially initialized first
cookielib was installed in sys.modules while its code executed.
At the end of cookielib on 2.x, it imports _LWPCookieJar, which in turn
imports the partially built cookielib from sys.modules, then subclasses
the CookieJar from /that/ module.
Everything is wonderful. Then the call returns back up into the import
mechanism which restarts the entire process -- only this time,
_LWPCookieJar is /not/ reinitialized, so the copy in sys.modules is
still left with types pointing at the old module!
So the duplicate import creates a new CookieJar which is not the base
class of LWPCookieJar. Tada! 3 hours debugging.
This is probably a performance fix in disguise, didn't realize things
were so broken. It may also be a regression elsewhere. Urgently need to
finish the tests.
Found due to a LGTM warning about unused loop variable (related). As far
as I can tell the callback was sending fullname multiple times. KeyError
check added because I found NestedTest failed - mitogen.parent had
mitogen as one of it's related, and mitogen was not in the cache.
Refs #61
Since the for loops don't contain any break statements the StreamErrors
will always be raised when the loop completes without the method
resturning.
See https://lgtm.com/rules/5980098/
Refs #61
Python 2.4 does not support explicit relative imports. They were added
at Python 2.5, along with `from __future__ import absolute_import`.
On 2.x this will mean the import is first (implicitly) tried relative,
but on 3.x it will always be tried absolute.
Fixes#92
I took the liberty of renaming ModuleFinder.STDLIB_DIRS to
_STDLIB_PATHS, since it felt like an implementation detail that
shouldn't be baked into a public API and stdlib can also be imported
from e.g. a zip file.
I also changed it to a set to handle any duplicates.
Fixes#86
Using the same test as in 7af97c0365,
transmitted wire bytes drops from 135,531 to 133,071 (-1.81%), while
received drops from 21,073 to 14,775 (-30%).
Combined, both changes shave 13,914 bytes (-8.6%) off aggregate
bandwidth usage.
Make it configurable as compression hurts in some scenarios.
For the 52 submodules of ansible.modules.system, this produced a 1602
byte pkg_present list. After stripping it becomes 406 bytes, and the
entire LOAD_MODULE size drops from 1988 bytes to 792 bytes (-60%).
For the 68 submodules of ansible.module_utils, 1902 bytes pkg_present
becomes 474 bytes (-75%), and LOAD_MODULE size drops from 2867 bytes to
1439 bytes (-49%).
In a simple test running Ansible's "setup" module followed by its "apt"
module, wire bytes sent drops from 140,357 to 135,531 (-3.4%).
It looks ugly as sin, but this nets about a 20% drop in user CPU time,
and close to 15% increase in throughput.
The average log call is around 10 opcodes, prefixing with '_v and' costs
an extra 2, but both are simple operations, and the remaining 10 are
skipped entirely when _v or _vv are False.
Turns out it is far too easy to burn through available file descriptors,
so try something else: self-pipes are per thread, and only temporarily
associated with a Lack that wishes to sleep.
Reduce pointless locking by giving Latch its own queue, and removing
Queue.Queue() use in some places.
Temporarily undo merging of of Waker and Latch, let's do this one step
at a time.
On Python 2.x, operations on pthread objects with a timeout set actually
cause internal polling. When polling fails to yield a positive result,
it quickly backs off to a 50ms loop, which results in a huge amount of
latency throughout.
Instead, give up using Queue.Queue.get(timeout=...) and replace it with
the UNIX self-pipe trick. Knocks another 45% off my.yml in the Ansible
examples directory against a local VM.
This has the potential to burn a *lot* of file descriptors, but hell,
it's not the 1940s any more, RAM is all but infinite. I can live with
that.
This gets things down to around 75ms per playbook step, still hunting
for additional sources of latency.
Fix a MyPy warning by only passing lists to select.select(). At least on
Python 2.x, select.select() was internally converting the sets to lists
anyway.
By the time lists become inefficient here, it is likely that
select.select() itself will also be inefficient, and need replaced with
.poll() or similar.
No discernible performance different when transferring django.db.models
to a local VM.
* Children should never generate a request for a module that has already
been sent, however there are a variety of edge cases where, e.g.
asynchronous calls are made into unloaded modules in a set of
children, causing those children to request modules (and deps) in a
different order, which might break deduplication. So add a warning to
catch when this happens, so we can figure out how to handle it.
Meanwhile it's only a warning since in the worst case, this just adds
needless latency.
* Don't bother treating sent packages separately, there doesn't seem to
be any need for this (after docs are updated to match how preloading
actually works now).
Overwriting 'fullname' variable caused basically nonsensical filtering.
Result was including the module being searched in the list of
dependencies, which was causing ModuleResponder to send it early, which
was causing contexts to start importing the module before preloading of
dependencies had completed.
* SIGTERM safety net prevents profiler from writing results, so disable
it when profiling is active.
* fix warning corrupting stream when profiling=True
Previously we'd send just None in GET_MODULE reply, but now since there
is no single request-reply structure, we must include the fullname in
the LOAD_MODULE response and make all of its data fields None to
indicate the same.
Doesn't yet implement the rules in the docs, but I think the doc rules
could maybe change to match this. Needs lots of cleanup work and
thorough testing, but this is a great start.
* Don't implement the rules for when preloading occurs yet
* Don't attempt to streamily preload modules downstream while this
context hasn't yet received the final module. There is quite
significant latency buried in here, but for now it's a lot of work to
fix.
This works well enough to handle at least the mitogen package, but it's
likely broken for anything bigger.
It seems gevent automatically sets blocking behaviour on fds produced by
the socket module, which causes the Python process we fork to fail
horribly. So in the child, always reset the blocking flag.
Although these are synonyms in Python 2.x, when using MyPy to typecheck
code use of file() causes spurious errors.
This commit also serves as one small step to Python 3.x compatibility,
since 3.x removes the file() builtin.
Since the above if block ends in a call to os.execv() this block will
only ever run when the if condition was false. Hence putting it in an
else clause is unnecessary.
Without this, it's possible for Waker to be start_received() after the
shutdown signal has already been sent, resulting in 5 second delay
during shutdown.
Additionally mask EBADF during os.write() to waker's write side.
Necessary since nothing synchronizes writer threads from the broker
thread during shutdown. Could be done with a lock instead, but this is
cheaper.
Can't figure out what it's supposed to do any more, and can't find a
version of Ansible before August 2016 (when I wrote that code) that
seems to need it.
Add some more mitigations to avoid sending dylibs.
Now there is a separate SHUTDOWN message that relies only on being
received by the broker thread, the main thread can be hung horribly and
the process will still eventually receive a SIGTERM.
If no ADD_ROUTE message has been received from the master associating a
stream with a particular context ID, then it is expected messages
originating from that context ID can only be routed via the parent.
This version is based on the modulefinder standard library module,
pruned back just to handle modules we know have been loaded already, and
to scan module-level imports only, rather than imports occurring in
class and function scope (crappy heuristic, but assume they are lazy
imports).
The ast and compiler modules were far too slow, whereas this version can
bytecode compile and scan all the imports for django.db.models (58
modules) in around 200ms.. 3.4ms per dependency, it's probably not going
to get much faster than that.
This probably worsens performance in the common case, but it prevents
runaway producers (see e.g. issue #36) from spending all their CPU
copying around huge strings.
It's also a small step towards a solution to issue #6, which will
replace the output buffer with some sort of fancier queue anyway.
This reduces a particular 40 second run of rsync to 1.5 seconds.
The last time I tested set_nonblock() as a fix for the rsync hang, I
used F_SETFD rather than F_SETFL, which resulted in no error, but also
did not set O_NONBLOCK. Turns out missing O_NONBLOCK was the problem.
The rsync hang was due to every context blocking in os.write() waiting
for either a parent or child buffer to empty, which was exacerbated by
rsync's own pipelining, that allows writes from both sides to proceed
even while reads aren't progressing. The hang was due to os.write() on a
blocking fd blocking until buffer space is available to complete the
write. Partial writes are only supported when O_NONBLOCK is enabled.
Now ssh requires a tty allocation. This presents a scalability problem,
a future version could selectively allocate a tty only if typing
passwords is desired.
Sudo's tty handling is now moved into mitogen.master.
* Support passing Context() objects in function calls and return values.
Now the fakessh demo from the documentation index would work
correctly.
* Since slaves can communicate with each other now, they should also use
the same approach to unpickling as the master already used. Collapse
away all the unpickle extension crap and hard-wire just the 3 types
that support unpickling.