Object serialization and deserialization of Python's built-in pickle library-Python Tutorial-php.cn

We recently want to archive the download results obtained by the crawler. This result is a Python object (we don’t want to simply save an HTML or json, we want the entire download process to be restored) , so I thought of using Python’s built-in pickle library (pickled cucumber library) to serialize objects into bytes and deserialize them when needed.

Object serialization and deserialization of Pythons built-in pickle library

You can simply understand the usage and function of pickle through the following code.

In [2]: import pickle 
In [3]: class A:
        pass 
In [4]: a = A()
In [5]: a.foo = &#39;hello&#39;
In [6]: a.bar = 2
In [7]: pick_ed = pickle.dumps(a) 
In [8]: pick_ed
Out[8]: b&#39;\x80\x03c__main__\nA\nq\x00)\x81q\x01}q\x02(X\x03\x00\x00\x00fooq\x03X\x05\x00\x00\x00helloq\x04X\x03\x00\x00\x00barq\x05K\x02ub.&#39;
In [9]: unpick = pickle.loads(pick_ed) 
In [10]: unpick
Out[10]: <__main__.A at 0x10ae67278>
In [11]: a
Out[11]: <__main__.A at 0x10ae67128>
In [12]: dir(unpick)
Out[12]:
[&#39;__class__&#39;,
&#39;__delattr__&#39;,
&#39;__dict__&#39;,
&#39;__dir__&#39;,
&#39;__doc__&#39;,
&#39;__eq__&#39;,
&#39;__format__&#39;,
&#39;__ge__&#39;,
&#39;__getattribute__&#39;,
&#39;__gt__&#39;,
&#39;__hash__&#39;,
&#39;__init__&#39;,
&#39;__init_subclass__&#39;,
&#39;__le__&#39;,
&#39;__lt__&#39;,
&#39;__module__&#39;,
&#39;__ne__&#39;,
&#39;__new__&#39;,
&#39;__reduce__&#39;,
&#39;__reduce_ex__&#39;,
&#39;__repr__&#39;,
&#39;__setattr__&#39;,
&#39;__sizeof__&#39;,
&#39;__slotnames__&#39;,
&#39;__str__&#39;,
&#39;__subclasshook__&#39;,
&#39;__weakref__&#39;,
&#39;bar&#39;,
&#39;foo&#39;]
In [13]: unpick.foo
Out[13]: &#39;hello&#39;
In [14]: unpick.bar
Out[14]: 2

Copy after login

You can see that the usage of pickle is somewhat similar to json, but there are several fundamental differences:

json is a cross-language universal data exchange format, generally expressed in text, human beings Readable. pickle is used to serialize Python objects, only for Python. The result of serialization is binary data, which is not readable by humans. Moreover, json can only serialize a part of the built-in types by default, and pickle can serialize quite a lot of data.

There is also an ancient marshal that is also built-in. But this library is mainly for .pyc files. Custom types are not supported and are not complete. For example, it cannot handle cyclic applications. If an object refers to itself, the Python interpreter will hang when using marshal.

Version compatibility issues

Since pickle is for Python, Python has different versions (and the difference between 2 and 3 is very big), so it must be taken into consideration Can the serialized object be deserialized by a higher (or lower?) version of Python.

There are currently 5 pickle protocol versions. The higher the version, the higher the Python version. 0-2 is for Python2, and 3-4 is for Python3.

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of 
Python.Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.Protocol 
version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to 
PEP 307for information about improvements brought by protocol 2. （从这个版本往后，性能有显著提高）Protocol version 3 
was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x.This is the 
default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.Protocol 
version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some 
data format optimizations. Refer to PEP 3154 for information about improvements brought byprotocol 4.

Copy after login

Most entrances to pickle Functions (such as dump(), dumps(), Pickler constructor) all accept a protocol version parameter, which has two built-in variables:

pickle.HIGHEST_PROTOCOL is currently 4

pickle. DEFAULT_PROTOCOL is currently 3

Usage

is similar to the built-in json module interface, dumps() is used to return serialization results, dump() is used to serialize and then write Import the file. In the same way, there are load() and loads(). Among them, you can specify the protocol version when serializing dump(s). This is not required when deserializing, and the version will be automatically identified. This is very similar to the zip command.

Serialization of built-in types

Most built-in types support serialization and deserialization. What needs special attention are functions. The serialization of a function is just based on its name and the module in which it is located. Neither the function's code nor its attributes (Python's functions are first-class objects and can have attributes) will not be serialized. This requires that the module in which the function is located must be importable in the unpickle environment, otherwise ImportError or AttributeError will occur.

There is something interesting here: all lambda functions are not Pickleable. Because their names are called .

Serialization of custom types

Like the experimental code at the beginning of this article, in most cases no additional The operation can realize the serialization/deserialization operation. It should be noted that during the deserialization process, the __init__() of the class is not called to initialize an object, but a new uninitialized instance is created and then its attributes are restored (very clever). The pseudocode is as follows:

def save(obj):
    return (obj.__class__, obj.__dict__) 
def load(cls, attributes):
    obj = cls.__new__(cls)
    obj.__dict__.update(attributes)
    return obj

Copy after login

If you want to do some additional operations during the serialization process, such as saving the state of the object, you can use the magic methods of the pickle protocol, the most common ones are __setstate__() and __getstate__( ).

Security issues (!)

The beginning of the pickle document says: Never unpickle a binary from an unknown source. Consider the following code:

>>> import pickle
>>> pickle.loads(b"cos\nsystem\n(S&#39;echo hello world&#39;\ntR.")
hello world
0

Copy after login

When this code is unpickled, it imports os.system() and then calls echo. There are no side effects. But what if it is rm -rf /·?

The suggestion given in the document is to implement the checking logic in Unpickler.find_class(). Function methods must be called when global variables are required.

import builtins
import io
import pickle
 
safe_builtins = {
    &#39;range&#39;,
    &#39;complex&#39;,
    &#39;set&#39;,
    &#39;frozenset&#39;,
    &#39;slice&#39;,
}
 
class RestrictedUnpickler(pickle.Unpickler):
 
    def find_class(self, module, name):
        # Only allow safe classes from builtins.
        if module == "builtins" and name in safe_builtins:
            return getattr(builtins, name)
        # Forbid everything else.
        raise pickle.UnpicklingError("global &#39;%s.%s&#39; is forbidden" %
                                     (module, name))
 
def restricted_loads(s):
    """Helper function analogous to pickle.loads()."""
    return RestrictedUnpickler(io.BytesIO(s)).load()

Copy after login

Compression

It will not automatically compress after pickling. I think this design is very good. It is decoupled. Pickle is just pickle. Things, compression is left to other libraries to do. And you can also find yourself that although the file after pickling is unreadable, the content is still presented in ASCII code and is not garbled. You need to call compress of the compression library. After actual compression, the volume is about 1/3 of the previous one, which is very impressive.

Summary

It is a bit difficult to keep global variables importable. The question I have to face is: If I need to open the things I pickled today in the future, can I still open them?

There are several versions here: project version, python version, pickle protocol version, and package versions that the project depends on. Among them, I think the python version and the pickle version can safely rely on their backward compatibility and are easy to solve. Mainly the project and version and dependent versions. If the object to be picked is very complex, then it is likely that the backup of the old version is not compatible with the new version. A possible solution is to completely lock all dependencies, such as recording their hash values. If you want to restore a certain binary sequence, then restore the specific dependencies and specific commits of the project at that time.

But for now, our requirement is basically to pickle a requests.Response object. I think we can rely on their backward compatibility. If one day there is a breaking change in requests, then even if our pickle is compatible, the code will not be compatible. At that time, other strategies can be considered.

The above is the detailed content of Object serialization and deserialization of Python's built-in pickle library. For more information, please follow other related articles on the PHP Chinese website!