From 051285437f02a1a376629002ba1368ad7aa18e97 Mon Sep 17 00:00:00 2001 From: David Wilson Date: Tue, 3 Oct 2017 18:11:35 +0530 Subject: [PATCH] importer: module preloading docs --- docs/howitworks.rst | 99 +++++++++++++++++++++++++++++++++++++++++---- docs/index.rst | 20 +++++++++ 2 files changed, 112 insertions(+), 7 deletions(-) diff --git a/docs/howitworks.rst b/docs/howitworks.rst index e624af37..5fa9e65d 100644 --- a/docs/howitworks.rst +++ b/docs/howitworks.rst @@ -270,9 +270,12 @@ Masters listen on the following handles: .. data:: mitogen.core.GET_MODULE - Receives `(reply_to, fullname)` 2-tuples, looks up the source code for the - module named ``fullname``, and writes the source along with some metadata - back to the handle ``reply_to``. If lookup fails, ``None`` is sent instead. + Receives the name of a module to load `fullname`, locates the source code + for ``fullname``, and routes one or more ``LOAD_MODULE`` messages back + towards the sender of the ``GET_MODULE`` request. See below for a longer + discussion of ``GET_MODULE``/``LOAD_MODULE``. + + If lookup fails, ``None`` is sent instead. .. data:: mitogen.core.ALLOCATE_ID @@ -285,6 +288,25 @@ Masters listen on the following handles: Children listen on the following handles: +.. _LOAD_MODULE: +.. data:: mitogen.core.LOAD_MODULE + + Receives `(pkg_present, path, compressed, related)` tuples, composed of: + + * **pkg_present**: Either ``None`` for a plain ``.py`` module, or a list of + canonical names of submodules existing witin this package. For example, a + ``LOAD_MODULE`` for the ``mitogen`` package would return a list like: + `["mitogen.core", "mitogen.fakessh", "mitogen.fakessh", ..]`. This list + is used by children to avoid generating useless round-trips due to Python + 2.x's ``import`` statement behavior. + * **path**: Original filesystem where the module was found on the master. + * **compressed**: :py:mod:`zlib`-compressed module source code. + * **related**: list of canonical module names on which this module appears + to depend. Used by children that have ever started any children of their + own to preload those children with ``LOAD_MODULE`` messages in response + to a ``GET_MODULE`` request. + +.. _CALL_FUNCTION: .. data:: mitogen.core.CALL_FUNCTION Receives `(mod_name, class_name, func_name, args, kwargs)` @@ -312,6 +334,7 @@ Children listen on the following handles: to it, and arranging for the connection to its parent to be closed shortly thereafter. +.. _ADD_ROUTE: .. data:: mitogen.core.ADD_ROUTE Receives `(target_id, via_id)` integer tuples, describing how messages @@ -470,9 +493,10 @@ Python if it can satisfy the import by itself, and if not, indicating to Python that it is capable of loading the module. In :py:meth:`load_module() ` an RPC is -started to the parent context, requesting the module source code. Once the -source is fetched, the method builds a new module object using the best -practice documented in PEP-302. +started to the parent context, requesting the module source code by way of a +``GET_MODULE``. If the parent context does not have the module available, it +recursively forwards the request upstream, while avoiding duplicate requests +for the same module from its own threads and any child contexts. Neutralizing ``__main__`` @@ -510,7 +534,7 @@ In Python 2.x, Python will first try to load ``mypkg.sys`` and ``mypkg.os``, which do not exist, before falling back on :py:mod:`sys` and :py:mod:`os`. These negative imports present a challenge, as they introduce a large number of -pointless network roundtrips. Therefore in addition to the +pointless network round-trips. Therefore in addition to the :py:mod:`zlib`-compressed source, for packages the master sends along a list of child modules known to exist. @@ -521,6 +545,67 @@ module does not appear in the enumeration of child modules belonging to the package that was provided by the master. +Import Preloading +################# + +To further avoid round-trips, when a module or package is requested by a child, +its bytecode is scanned in the master to find all the module's ``import`` +statements, and of those, which associated modules appear to have been loaded +in the master's :py:data:`sys.modules`. + +The :py:data:`sys.modules` check is necessary to handle various kinds of +conditional execution, for example, when a module's code guards an ``import`` +statement based on the active Python runtime version, operating system, or +optional third party dependencies. + +Before replying to a child's request for a module with dependencies: + +* If the request is for a package, any dependent modules used by the package + that appear within the package itself are known to be missing from the child, + since the child requested the top-level package module, therefore they are + pre-loaded into the child using ``LOAD_MODULE`` messages before sending the + ``LOAD_MODULE`` message for the requested package module itself. In this way, + the child will already have dependent modules cached by the time it receives + the requested module, avoiding one round-trip for each dependency. + + For example, when a child requests the ``django`` package, and the master + determines the ``django`` module code in the master has import statements for + ``django.utils``, ``django.utils.lru_cache``, and ``django.utils.version``, + and that exceution of the module code on the master caused those modules to + appear in the master's :py:data:`sys.modules`, there is high probability + execution of the ``django`` module code in the child will cause the same + modules to be loaded. Since all those modules exist within the ``django`` + package, and we already know the child lacks that package, it is safe to + assume the child will make follow-up requests for those modules too. + + In the example, this replaces 4 round-trips with 1 round-trip. + +For any package module ever requested by a child, the parent keeps a note of +the name of the package for one final optimization: + +* If the request is for a sub-module of a package, and it is known the child + loaded the package's implementation from the parent, then any dependent + modules of the requested module at any nesting level within the package that + is known to be missing are sent using ``LOAD_MODULE`` messages before sending + the ``LOAD_MODULE`` message for the requested module, avoiding 1 round-trip + for each dependency within the same top-level package. + + For example, when a child has previously requested the ``django`` package, + the parent knows it was completely absent on the child. Therefore when the + child subsequently requests the ``django.db`` module, it is safe to assume + the child will generate subsequent ``GET_MODULE`` requests for the 2 + ``django.conf``, 3 ``django.core``, 2 ``django.db``, 3 ``django.dispatch``, + and 7 ``django.utils`` indirect dependencies for ``django.db``. + + In the example, this replaces 17 round-trips with 1 round-trip. + +The method used to detect import statements is similar to the standard library +:py:mod:`modulefinder` module: rather than analyze module source code, +``IMPORT_NAME`` opcodes are extracted from the module's bytecode. This is since +clean source analysis methods (:py:mod:`ast` and :py:mod:`compiler`) are an +order of magnitude slower, and incompatible across major Python versions. + + Child Module Enumeration ######################## diff --git a/docs/index.rst b/docs/index.rst index 49a56464..36c44f82 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -142,6 +142,26 @@ further effort. .. _py2exe: http://www.py2exe.org/ +Common sources of import latency and bandwidth consumption are mitigated: + +* Modules need only be uploaded once per directly connected context. Subsequent + requests for modules from children of that context will be served by the + child itself. + +* Imports by threads within a context triggering a load are deduplicated and + joined with any identical requests triggered by other threads in the same + context and children in the context's subtree. + +* No roundtrip is required for negative responses due to Python 2's import + statement semantics: children have a list of submodules belonging to a + package, and ignore requests for submodules that did not exist on the master. + +* Imports are extracted from each module, compared to those found in memory, + and recursively preloaded into children requesting that module, minimizing + round-trips to one per package nesting level. For example, + :py:mod:`django.db.models` only requires 3 round-trips to transfer 456KiB, + representing 1.7MiB of uncompressed source split across 148 modules. + SSH Client Emulation ####################