使用 Zig 和 Python 的高性能且可扩展的 Web 服务器-Python教程-PHP中文网

前言

我对软件开发充满热情，特别是按照人体工程学创建软件系统的难题，该系统可以解决最广泛的问题，同时做出尽可能少的妥协。我也喜欢将自己视为一名系统开发人员，根据 Andrew Kelley 的定义，这意味着一名有兴趣完全理解他们正在使用的系统的开发人员。在这篇博客中，我与您分享我解决以下问题的想法：构建可靠且高性能的全栈企业应用程序。这是一个很大的挑战，不是吗？在博客中，我重点关注“高性能网络服务器”部分 - 我觉得我可以在这方面提供一个新鲜的视角，因为其余部分要么是众所周知的，要么我没有什么可补充的。

一个重要的警告 - 将会有没有代码示例，我还没有实际测试过这一点。是的，这是一个重大缺陷，但实际上实现这一点需要花费很多时间，而我没有时间，在发布有缺陷的博客和根本不发布它之间，我坚持前者。我们已警告您。

A performant and extensible Web Server with Zig and Python

我们将用哪些部分来组装我们的应用程序？

一个您熟悉的前端，但如果您想要最小的依赖性 - WASM 形式的 HTMX 中有 Zig。
Zig Web 服务器，与 Linux 内核紧密集成。这是性能部分，我将在本博客中重点介绍。
Python 后端，与 Zig 集成。这是复杂的部分。
与 Temporal 和 Flowable 等持久执行系统集成。这有助于提高可靠性，并且不会在博客中讨论。

确定了我们的工具，让我们开始吧！

协程是否被高估了？

Zig 没有对协程的语言级别支持:( 而协程是构建每个高性能 Web 服务器的基础。那么，尝试没有意义吗？

等一下，让我们首先戴上系统程序员的帽子。协程不是灵丹妙药，没有什么是灵丹妙药。涉及的实际好处和缺点是什么？

众所周知，协程（用户空间线程）更加轻量级且速度更快。但具体是通过什么方式呢？（这里的答案大部分都是猜测，请持保留态度并亲自测试一下）

默认情况下，它们一开始的堆栈空间较小（2KB 而不是 4MB）。但这可以手动调整。
它们更好地与用户空间调度程序配合。由于内核调度程序是抢占式的，因此线程执行的任务会被分配时间片。如果实际任务不适合切片 - 一些 CPU 时间就会被浪费。与 Goroutines 不同，Goroutines 将不同 Goroutine 执行的尽可能多的微任务放入操作系统线程的同一时间片中。

A performant and extensible Web Server with Zig and Python

例如，Go 运行时将 goroutine 复用到操作系统线程上。线程共享页表以及进程拥有的其他资源。如果我们在混合中引入 CPU 隔离和亲和性 - 线程将在各自的 CPU 核心上持续运行，所有操作系统数据结构将保留在内存中，无需换出，用户空间调度程序将 CPU 时间分配给 goroutine精度，因为它使用协作多任务模型。竞争还可能吗？

性能的提升是通过将线程的操作系统级抽象放在一边并用 goroutine 的抽象来代替来实现的。但翻译中没有丢失任何内容吗？

我们可以和内核合作吗？

我认为独立执行单元的“真正”操作系统级抽象甚至不是线程 - 它实际上是操作系统进程。实际上，这里的区别并不那么明显——区分线程和进程的只是不同的 PID 和 TID 值。至于文件描述符、虚拟内存、信号处理程序、跟踪资源 - 这些是否对于子进程是独立的，在“克隆”系统调用的参数中指定。因此，我将使用术语“进程”来表示拥有自己的系统资源（主要是 CPU 时间、内存、打开文件描述符）的执行线程。

A performant and extensible Web Server with Zig and Python

Now why is this important? Each unit of execution has its own demands for system resources. Each complex task can be broken down into units, where each one can make its own, predictable, request for resources - memory and CPU time. And the further up the tree of subtasks you go, towards a more general task - the system resources graph forms a bell curve with long tails. And it is your responsibility to make sure that the tails do not overrun the system resources limit. But how is that done, and what happens if that limit is in fact overrun?

If we use the model of a single process and many coroutines for independent tasks, when one coroutine overruns the memory limit - because memory usage is tracked at the process level, the whole process is killed. That's in the best case - if you make use of cgroups (which is automatically the case for pods in Kubernetes, which have a cgroup per pod) - the whole cgroup is killed. Making a reliable system needs this to be taken into account. And what about CPU time? If our service gets hit with many compute-intensive requests at the same time, it will become unresponsive. Then deadlines, cancelations, retries, restarts follow.

The only realistic way to deal with these scenarios for most mainstream software stacks is leaving "fat" in the system - some unused resources for the tail of the bell curve - and limiting the number of concurrent requests - which, again, leads to unused resources. And even with that, we will get OOM killed or go unresponsive every once in a while - including for "innocent" requests that happen to be in the same process as the outlier. This compromise is acceptable to many, and serves software systems in practice well enough. But can we do better?

A concurrency model

Since resource usage is tracked per-process, ideally we would spawn a new process for each small, predictable unit of execution. Then we set the ulimit for cpu time and memory - and we're good to go! ulimit has soft and hard limits, which will allow the process to terminate gracefully upon hitting the soft limit, and if that does not occur, possibly due to a bug - be terminated forcefully upon hitting the hard limit. Unfortunately, spawning new processes on Linux is slow, spawning new process per request is not supported for many web frameworks, as well as other systems such as Temporal. Additionally, process switching is more expensive - which is mitigated by CoW and cpu pinning, but still not ideal. Long-running processes are an inevitable reality, unfortunately.

A performant and extensible Web Server with Zig and Python

The further we go from the clean abstraction of short-lived processes, the more OS-level work we would need to take care of ourselves. But there are also benefits to be gained - such as making use of io_uring for batching IO between many threads of execution. In fact, if a large task is made up of sub-tasks - do we really care about their individual resource utilization? Only for profiling. But if for the large task we could manage (cut off) the tails of the resource bell curve, that would be good enough. So, we could spawn as many processes as the requests we wish to handle simultaneously, have them be long-lived, and simply readjust the ulimit for each new request. So when a request overruns its resource constraints, it gets an OS signal and is able to terminate gracefully, unaffecting other requests. Or, if the high resource usage is intentional, we could tell the client to pay for a higher resource quota. Sounds pretty good to me.

But the performance will still suffer, compared to a coroutine-per-request approach. First, copying around the process memory table is expensive. Because the table contains references to memory pages, we could make use of hugepages, thus limiting the size of data to be copied. This is only directly possible with low-level languages, such as Zig. Additionally, the OS level multitasking is preemptive and not cooperative, which will always be less efficient. Or is it?

Cooperative multitasking with Linux

There is the syscall sched_yield, which allows the thread to relinquish the CPU when it has completed its portion of work. Seems quite cooperative. Could there be a way to request a time slice of a given size as well? Actually, there is - with the scheduling policy SCHED_DEADLINE. This is a realtime policy, which means that for the requested CPU time slice, the thread runs uninterrupted. But if the slice is overrun - preemption kicks in, and your thread is swapped out and deprioritized. And if the slice is underrun - the thread can call sched_yield to signal an early finish, allowing other threads to run. That looks like the best of both worlds - a cooperative and preemtive model.

A performant and extensible Web Server with Zig and Python

一个限制是 SCHED_DEADLINE 线程无法分叉。这给我们留下了两种并发模型 - 要么每个请求一个进程，它为自己设置最后期限，并运行一个事件循环以实现高效的 IO，要么一个从一开始就为每个微任务生成一个线程的进程，每个微任务设置自己的截止日期，并利用队列相互通信。前者更直接，但需要用户空间中的事件循环，后者更多地利用内核。

两种策略都达到了与协程模型相同的目的 - 通过与内核合作，可以让应用程序任务以最小的中断运行。

Python 作为嵌入式脚本语言

这一切都是为了高性能、低延迟、低级别的方面，而这正是 Zig 的闪光点。但当涉及到应用程序的实际业务时，灵活性比延迟更有价值。如果一个流程涉及真人签署文档，则计算机的延迟可以忽略不计。此外，尽管性能受到影响，面向对象语言为开发人员提供了更好的原语来对业务领域进行建模。在最远的一端，像 Flowable 和 Camunda 这样的系统允许管理和运营人员以更大的灵活性和更低的进入门槛对业务逻辑进行编程。像 Zig 这样的语言对此没有帮助，只会阻碍你。

A performant and extensible Web Server with Zig and Python

另一方面，Python 是最具活力的语言之一。类、对象——它们都是底层的字典，可以在运行时按照你喜欢的方式进行操作。这会降低性能，但使得使用类和对象以及许多巧妙的技巧对业务进行建模变得实用。 Zig 则相反 - Zig 中故意添加了一些巧妙的技巧，为您提供最大程度的控制。我们可以通过让它们互操作来结合它们的力量吗？

确实可以，因为两者都支持 C ABI。我们可以让 Python 解释器在 Zig 进程内运行，而不是作为单独的进程运行，从而减少运行时成本和粘合代码的开销。这进一步允许我们在 Python 中使用 Zig 的自定义分配器 - 设置一个区域来处理单个请求，从而减少（如果没有消除）垃圾收集器的开销，并设置内存上限。一个主要限制是 CPython 运行时生成用于垃圾收集和 IO 的线程，但我没有发现任何证据表明它确实如此。通过使用 AbstractMemoryLoop 中的“context”字段，我们可以将 Python 连接到 Zig 中的自定义事件循环中，并进行每个协程内存跟踪。可能性是无限的。

结论

我们讨论了并发、并行性以及与操作系统内核的各种形式集成的优点。这项探索缺乏基准和代码，我希望它能通过提供的想法的质量来弥补。你尝试过类似的事情吗？你有什么想法？欢迎反馈:)

进一步阅读

https://linux.die.net/man/2/clone
https://man7.org/linux/man-pages/man7/sched.7.html
https://man7.org/linux/man-pages/man2/sched_yield.2.html
https://rigtorp.se/low-latency-guide/
https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/
https://hadar.gr/2017/lightweight-goroutines

以上是使用 Zig 和 Python 的高性能且可扩展的 Web 服务器的详细内容。更多信息请关注PHP中文网其他相关文章！