[Python] Comment charger paresseusement un module Python ? - analyser LazyLoader de MLflow-Tutoriel Python-php.cn

[Python] How do we lazyload a Python module? - analyzing LazyLoader from MLflow

(source de l'image : https://www.irasutoya.com/2019/03/blog-post_72.html)

Introduction

Un jour, je parcourais quelques bibliothèques ML populaires en Python, dont MLflow. En jetant un coup d'œil à son code source, une classe a attiré mon intérêt, LazyLoader dans __init__.py (enfin, cela reflète en fait le projet wandb, mais le code d'origine a changé par rapport à ce que MLflow utilise actuellement, comme vous pouvez le voir).

Vous avez probablement entendu parler du concept de chargement différé dans de nombreux contextes, tels que le chargement d'images frontales Web, la stratégie de mise en cache, etc. Je pense que l'essence de tous ces concepts de chargement paresseux est que "Je suis trop paresseux pour charger MAINTENANT" - oui, les mots cachés "en ce moment" . À savoir, l’application chargera et utilisera cette ressource uniquement lorsque cela sera nécessaire. Ainsi, ici, dans cette bibliothèque MLflow, les modules ne sont chargés que lorsque les ressources qu'ils contiennent (variables, fonctions et classes) sont accessibles.

Mais COMMENT ? C'était mon principal intérêt. J'ai donc lu le code source, qui semblait très simple à première vue. Cependant, étonnamment, il a fallu un peu de temps pour comprendre comment cela fonctionne, et j'ai beaucoup appris en lisant le code. Cet article concerne l'analyse de ce code source de MLflow afin que nous comprenions comment fonctionne un tel chargement paresseux à l'aide de diverses techniques du langage Python.

Jouer avec LazyLoader

Pour les besoins de notre analyse, j'ai créé un package simple appelé lazyloading sur ma machine locale et placé les modules comme suit :


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py

Copier après la connexion

__init__.py : ce fichier transforme le répertoire entier en un package.
__main__.py : Ce fichier est le point d'entrée lorsque nous voulons exécuter l'intégralité du package comme suit : python -m lazyloading.
lazy_load.py : LazyLoader est dans ce fichier.
heavy_module.py : Ceci représente un module avec des packages lourds à charger (comme PyTorch) pour une simulation :


import time

for i in range(5):
    time.sleep(1)
    print(5 - i, " seconds left before loading")

print("I am heavier than Pytorch!")

HEAVY_ATTRIBUTE = "heavy”

Copier après la connexion

Ensuite, nous importons ce heavy_module dans __main__.py :


if __name__ == "__main__":
    from lazyloading import heavy_module

Copier après la connexion

Exécutons ce package et voyons le résultat :


python -m lazyloading
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
I am heavier than pytorch!

Copier après la connexion

Ici, nous pouvons clairement voir que si nous importons simplement des packages lourds tels que PyTorch, cela pourrait entraîner une surcharge pour l'ensemble de l'application. C'est pourquoi nous avons besoin d'un chargement paresseux ici. Changeons __main__.py pour qu'il ressemble à ceci :


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")
    print("nothing happens yet")
    print(heavy_module.HEAVY_ATTRIBUTE)

Copier après la connexion

Et le résultat devrait être :


python -m lazyloading
nothing happens yet
5  seconds left before loading
4  seconds left before loading
3  seconds left before loading
2  seconds left before loading
1  seconds left before loading
heavy

Copier après la connexion

Oui, tout module importé par LazyLoader n'a pas besoin d'exécuter de script ni d'importer d'autres packages. Cela se produit uniquement lors de l'accès à un attribut du module. C'est le pouvoir du lazyloading !

Comment fonctionne LazyLoader dans MLflow ? - analyse du code source

Le code lui-même est court et simple. J'ai ajouté des annotations de type et quelques commentaires (lignes entourées de <, >) pour des explications. Tous les autres commentaires sont ceux du code source original.


"""Utility to lazy load modules."""
import importlib
import sys
import types

from typing import Any, TypeVar

T = TypeVar("T") # <this is added by me>

class LazyLoader(types.ModuleType):
    """Class for module lazy loading.

    This class helps lazily load modules at package level, which avoids pulling in large
    dependencies like `tensorflow` or `torch`. This class is mirrored from wandb's LazyLoader:
    https://github.com/wandb/wandb/blob/79b2d4b73e3a9e4488e503c3131ff74d151df689/wandb/sdk/lib/lazyloader.py#L9
    """

    _local_name: str # <the name of the package that is used inside code>
    _parent_module_globals: dict[str, types.ModuleType] # <importing module's namespace, accessible by calling globals()>
    _module: types.ModuleType | None # <actual module>

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.ModuleType(name=str(name)), the full package name (such as pkg.subpkg.subsubpkg)>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name)) 

    def _load(self) -> types.ModuleType:
        """Load the module and insert it into the parent's globals."""
        if self._module:
            # If already loaded, return the loaded module.
            return self._module

        # Import the target module and insert it into the parent's namespace

        # <see https://docs.python.org/3/library/importlib.html#importlib.import_module>
        # <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)>
        # <here, self.__name__ is the variable `name` in __init__>
        # <this is why that `name` in __init__ must be the full module path>
        module = importlib.import_module(self.__name__) # this automatically updates sys.modules

        # <add the name of the module to the importing module(=parent module)'s namespace>
        # <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module>
        self._parent_module_globals[self._local_name] = module

        # <add the module to the list of loaded modules for caching>
        # <see https://docs.python.org/3/reference/import.html#the-module-cache>
        # <this makes possible to import cached module with the variable _local_name
        sys.modules[self._local_name] = module

        # Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
        # lookups are efficient (`__getattr__` is only called on lookups that fail).
        self.__dict__.update(module.__dict__)

        return module

    def __getattr__(self, item: T) -> T:
        module = self._load()
        return getattr(module, item)

    def __dir__(self):
        module = self._load()
        return dir(module)

    def __repr__(self):
        if not self._module:
            return f"<module '{self.__name__} (Not loaded yet)'>"
        return repr(self._module)

Copier après la connexion

Maintenant, étudions le code tout en chargeant paresseux notre heavy_module. Puisque nous n'avons plus besoin de simuler la lourdeur du module, supprimons la partie boucle time.sleep(1).

1. Création d'une instance de LazyLoader, proxy du module d'origine

Regardons __init__() de LazyLoader.


class LazyLoader(types.ModuleType):
    # …
    # code omitted
    # …

    def __init__(
        self, 
        local_name: str, 
        parent_module_globals: dict[str, types.ModuleType], 
        name: Any # <to be used in types.ModuleType(name=str(name)); the full package name(such as pkg.subpkg.subsubpkg)>
    ):
        self._local_name = local_name
        self._parent_module_globals = parent_module_globals
        self._module = None

        super().__init__(str(name))

Copier après la connexion

Nous fournissons local_name, parent_module_globals et le nom du constructeur __init__(). Pour le moment, nous ne sommes pas sûrs de ce que tout cela signifie, mais au moins la dernière ligne indique que nous générons réellement un module - super().__init__(str(name)), puisque LazyLoader hérite de types.ModuleType. En fournissant le nom de la variable, notre module créé par LazyLoader est reconnu comme un module portant le nom name (qui est le même que heavy_module.__name__).

L'impression du module lui-même le prouve :


# __main__.py
# run python -m lazyloading

if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.__name__)

Copier après la connexion

qui donne sur notre borne :


lazyloading.heavy_module

Copier après la connexion

Cependant, dans le constructeur, nous avons uniquement attribué des valeurs aux variables d'instance et donné le nom du module à ce proxy module. Maintenant, que se passe-t-il lorsque l'on essaie d'accéder à un attribut du module ?

2. Accéder à un attribut - getattribute, getattr et getattr

C'est l'une des parties amusantes de ce cours. Que se passe-t-il lorsque l'on accède à un attribut d'un objet Python en général ? Supposons que nous accédions à HEAVY_ATTRIBUTE de heavy_module en appelant heavy_module.HEAVY_ATTRIBUTE. À partir du code ici ou de votre propre expérience dans plusieurs projets Python, vous pourriez deviner que __getattr__() est appelé, et c'est en partie correct. Regardez les documents officiels :

Appelé lorsque l'accès à l'attribut par défaut échoue avec une AttributeError (soit getattribute() lève une AttributeError car name n'est pas un attribut d'instance ou un attribut dans l'arborescence de classes pour soi ; ou get d'une propriété de nom déclenche AttributeError).

(Please ignore __get__ because it is out of scope of this post, and our LazyLoader doesn’t implement __get__ either).

So __getattribute__() the key method here is __getattribute__. According to the docs, when we try to access an attribute, __getattribute__ will be called first, and if the attribute we’re looking for cannot be found by __getattribute__, AttributeError will be raised, which will in turn invoke our __getattr__ in the code. To verify this, let’s override __getattribute__ of the LazyLoader class, and change __getattr__() a little bit as follows:


def __getattribute__(self, name: str) -> Any:
    try:
        print(f"__getattribute__ is called when accessing attribute '{name}'")
        return super().__getattribute__(name)

    except Exception as error:
        print(f"an error has occurred when __getattribute__() is invoked as accessing '{name}': {error}")
        raise

def __getattr__(self, item: T) -> T:
    print(f"__getattr__ is called when accessing attribute '{item}'")
    module = self._load()
    return getattr(module, item)

Copier après la connexion

When we access HEAVY_ATTRIBUTE that exists in heavy_module, the result is:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)

Copier après la connexion


python -m lazyloading
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
an error has occurred when __getattribute__() is invoked as accessing 'HEAVY_ATTRIBUTE': module 'lazyloading.heavy_module' has no attribute 'HEAVY_ATTRIBUTE'
__getattr__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
__getattribute__ is called when accessing attribute '_load'
__getattribute__ is called when accessing attribute '_module'
__getattribute__ is called when accessing attribute '__name__'
I am heavier than Pytorch!
__getattribute__ is called when accessing attribute '_parent_module_globals'
__getattribute__ is called when accessing attribute '_local_name'
__getattribute__ is called when accessing attribute '__dict__'
heavy

Copier après la connexion

So __getattr__ is actually not called directly, but __getattribute__ is called first, and it raises AttributeError because our LazyLoader instance doesn’t have attribute HEAVY_ATTRIBUTE. Now __getattr__() is called as a failover. Then we meet getattr(), but this code line getattr(module, item) is equivalent to code module.item in Python. So eventually, we access the HEAVY_ATTRIBUTE in the actual module heavy_module, if module variable in __getattr__() is correctly imported and returned by self._load().

But before we move on to investigating _load() method, let’s call HEAVY_ATTRIBUTE once again in __main__.py and run the package:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("lazyloading.heavy_module", globals(), "lazyloading.heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)
    print(heavy_module.HEAVY_ATTRIBUTE)

Copier après la connexion

Now we see the additional logs on the terminal:


# … the same log as above
__getattribute__ is called when accessing attribute 'HEAVY_ATTRIBUTE'
heavy

Copier après la connexion

It seems that __getattribute__ can access HEAVY_ATTRIBUTE now inside the proxy module(our LazyLoader instance). This is because(!!!spoiler alert!!!) _load caches the accessed attribute in __dict__ attribute of the LazyLoader instance. We’ll get back to this in the next section.

3. Loading and caching the actual module

This section covers the core part the post - loading the actual module in the function _load().

3-1. Module caching at the level of LazyLoader class

First, it checks whether our LazyLoader instance has already imported the module before (which reminds us of the Singleton pattern).


if self._module:
    # If already loaded, return the loaded module.
    return self._module

Copier après la connexion

3-2. Importing the actual module with importlib.import_module

Otherwise, the method tries to import the module named __name__, which we saw in the __init__ constructor:


# <see https://docs.python.org/3/library/importlib.html#importlib.import_module>
# <absolute import, importing the module itself from a package rather than the top-level package only(like __import__)>
# <here, self.__name__ is the variable `name` in __init__>
# <this is why that `name` in __init__ must be the full module path>
module = importlib.import_module(self.__name__) # this automatically updates sys.modules

Copier après la connexion

According to the docs of importlib.import_module, when we don’t provide the pkg argument and only the path string, the function tries to import the package in the absolute manner. Therefore, when we create a LazyLoader instance, the name argument should be the absolute term. You can run your own experiment to see it raises ModuleNotFoundError:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "heavy_module")

    print(heavy_module.HEAVY_ATTRIBUTE)

Copier après la connexion


# logs omitted
ModuleNotFoundError: No module named 'heavy_module'

Copier après la connexion

Notably, invoking importlib.import_module(self.__name__) caches the module with name self.__name__ in the global scope. If you run the following lines in __main__.py


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader
    heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")

    # check whether the module is cached at the global scope
    import sys
    print("lazyloading.heavy_module" in sys.modules)

    # accessing any attribute to load the module
    heavy_module.HEAVY_ATTRIBUTE

    print("lazyloading.heavy_module" in sys.modules)

Copier après la connexion

and run the package, then the logs should be:


python -m lazyloading
False
I am heavier than Pytorch!
True

Copier après la connexion

This way of caching using sys.modules is related to the next two lines that also cache the module in different ways.

3-3. Caching the module with given local_name


# <add the name of the module to the importing module(=parent module)'s namespace>
# <so that you can use this module's name as a variable inside the importing module, even if it is called inside a function defined in the importing module>
self._parent_module_globals[self._local_name] = module

# <add the module to the list of loaded modules for caching>
# <see https://docs.python.org/3/reference/import.html#the-module-cache>
# <this makes possible to import cached module with the variable _local_name
sys.modules[self._local_name] = module

Copier après la connexion

Both lines cache the module in the dictionaries self._parent_module_globals and sys.modules respectively, but with the key self._local_name(not self.__name__). This is the variable we provided as local_name when creating this proxy module instance with __init__(). But what does this caching accomplish?

First, we can use the module with the given _local_name in the "parent module"’s globals(from the parameter’s name and seeing how MLflow uses in its uppermost __init__.py, we can infer that here the word globals means (globals()). This means that importing the module inside a function doesn’t limit the module to be used outside the function’s scope:


if __name__ == "__main__":
    from lazyloading.lazy_load import LazyLoader

    def load_heavy_module() -> None:
        # import the module inside a function
        heavy_module = LazyLoader("heavy_module", globals(), "lazyloading.heavy_module")
        print(heavy_module.HEAVY_ATTRIBUTE)

    # loads the heavy_module inside the function's scope
    load_heavy_module()

    # the module is now in the scope of this module
    print(heavy_module)

Copier après la connexion

Running the package gives:


python -m lazyloading
I am heavier than Pytorch!
heavy
<module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)

Copier après la connexion

Of course, if you provide the second argument locals(), then you’ll get NameError(give it a try!).

Second, we can also import the module in any other place inside the whole package with the given local name. Let’s create another module heavy_module_loader.py inside the current package lazyloading :


lazyloading/
├─ __init__.py
├─ __main__.py
├─ lazy_load.py
├─ heavy_module.py
├─ heavy_module_loader.py

Copier après la connexion

Note that I used a custom name heavy_module_local for the local variable name of the proxy module.


# heavy_module_loader.py

from lazyloading.lazy_load import LazyLoader

heavy_module = LazyLoader("heavy_module_local", globals(), "lazyloading.heavy_module")
heavy_module.HEAVY_ATTRIBUTE

Copier après la connexion

Now let __main__.py be simpler:


from lazyloading import heavy_module_loader

if __name__ == "__main__":
    import heavy_module_local
    print(heavy_module_local)

Copier après la connexion

Your IDE will probably alert this line as having a syntax error, but actually running it will give us the expected result:


python -m lazyloading
I am heavier than Pytorch!
<module 'lazyloading.heavy_module' from ‘…’> # the path of the heavy_module(a Python file)

Copier après la connexion

Although MLflow seems to use the same string value for both local_name and name when creating LazyLoader instances, we can use the local_name as an alias for the actual package name, thanks to this caching mechanism.

3-4. Caching the attributes of the actual module in dict


# Update this object's dict so that if someone keeps a reference to the `LazyLoader`,
# lookups are efficient (`__getattr__` is only called on lookups that fail).
self.__dict__.update(module.__dict__)

Copier après la connexion

In Python, the attribute __dict__ gives the dictionary of attributes of the given object. Updating this proxy module’s attributes with the actual module’s ones makes the user easier to access the attributes of the real one. As we discussed in section 2(2. Accessing an attribute - __getattribute__, __getattr__, and getattr) and noted in the comments of the original source code, this allows __getattribute__ and __getattr__ to directly access the target attributes.

In my view, this part is somewhat unnecessary, as we already cache modules and use them whenever their attributes are accessed. However, this could be useful when we need to debug and inspect __dict__.

4. dir and repr

Similar to __dict__, these two dunder functions might not be strictly necessary when using LazyLoader modules. However, they could be useful for debugging. __repr__ is particularly helpful as it indicates whether the module has been loaded.


<p>if not self.<em>module</em>:<br>
    return f"<module '{self.<em>name</em>_} (Not loaded yet)'>"<br>
return repr(self._module)</p>

Copier après la connexion

Conclusion

Although the source code itself is quite short, we covered several advanced topics, including importing modules, module scopes, and accessing object attributes in Python. Also, the concept of lazyloading is very common in computer science, but we rarely get the chance to examine how it is implemented in detail. By investigating how LazyLoader works, we learned more than we expected. Our biggest takeaway is that short code doesn’t necessarily mean easy code to analyze!

Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!