Home>Article>Operation and Maintenance> what is linux swap
Linux swap refers to the Linux swap partition, which is an area on the disk. It can be a partition, a file, or a combination of the two; swap is similar to Windows virtual memory, which is when When the memory is insufficient, part of the hard disk space is virtualized into memory to solve the problem of insufficient memory capacity.
#The operating environment of this tutorial: linux5.9.8 system, Dell G3 computer.
linux swap
Linux swap partition (swap), or memory replacement space (swap space), is an area on the disk , can be a partition, a file, or a combination of them.
SWAP functions like "virtual memory" under Windows systems. When the physical memory is insufficient, part of the hard disk space is used as a SWAP partition (virtually converted into memory) to solve the problem of insufficient memory capacity.
SWAP means swap. As the name suggests, when a process requests memory from the OS and finds that it is insufficient, the OS will swap out the temporarily unused data in the memory and place it in the SWAP partition. This process is called SWAP OUT. When a process needs this data and the OS finds that there is free physical memory, it will swap the data in the SWAP partition back to the physical memory. This process is called SWAP IN.
Of course, there is an upper limit on the swap size. Once the swap is used up, the operating system will trigger the OOM-Killer mechanism to kill the process that consumes the most memory to release memory.
Why does the database system dislike swap?
Obviously, the original intention of the swap mechanism is to alleviate the embarrassment of directly roughing the OOM process when the physical memory is exhausted. But frankly speaking, almost all databases do not like swap, whether it is MySQL, Oracal, MongoDB or HBase. Why? This is mainly related to the following two aspects:
1. Database systems are generally sensitive to response delays. If swap is used instead of memory, the database service performance will inevitably be unacceptable. For systems that are extremely sensitive to response delays, there is no difference between too much delay and service unavailability. What is more serious than service unavailability is that the process will not die in the swap scenario, which means that the system has always been unavailable...Think about it again. Is it a better choice to directly oom without using swap? In this way, many high-availability systems will directly switch between master and slave, and users will basically not be aware of it.
2. In addition, for distributed systems such as HBase, we are not actually worried about a certain node going down, but we are worried about a certain node being stuck. If a node goes down, at most a small number of requests will be temporarily unavailable and can be recovered by retrying. However, if a node is blocked, all distributed requests will be blocked, and the server-side thread resources will be occupied, causing the entire cluster request to be blocked, and even the cluster will be brought down.
Considering these two perspectives, it makes sense that all databases do not like swap!
The working mechanism of swap
Since the databases are not interested in swap, is it necessary to use the swapoff command to turn off the disk cache? What about features? No, you can think about it, what does turning off the disk cache mean? No system in the actual production environment will be so radical. You must know that the world is never either 0 or 1. Everyone will more or less choose to walk in the middle, but some are biased towards 0 and some are biased towards 1. Obviously, when it comes to swap, the database must choose to use it as little as possible. Several requirements in HBase official documents are actually to implement this policy: reduce the impact of swap as much as possible. Only by knowing yourself and your enemy can you win every battle. To reduce the impact of swap, you must understand how Linux memory recycling works, so as not to miss any possible doubts.
Let’s first take a look at how swap is triggered?
To put it simply, Linux will trigger memory recycling in two scenarios. One is that it will trigger memory recycling immediately when it finds that there is not enough free memory during memory allocation; the other is that a daemon is turned on. The process (swapd process) periodically checks the system memory and actively triggers memory recycling after the available memory drops to a specific threshold. There is nothing much to say about the first scenario. Let’s focus on the second scenario, as shown in the figure below:
This will bring out the first thing we pay attention to Parameter: vm.min_free_kbytes, represents the minimum watermark[min] of free memory reserved by the system, and affects watermark[low] and watermark[high]. It can be simply thought of as:
watermark[min] = min_free_kbytes watermark[low] = watermark[min] * 5 / 4 = min_free_kbytes * 5 / 4 watermark[high] = watermark[min] * 3 / 2 = min_free_kbytes * 3 / 2 watermark[high] - watermark[low] = watermark[low] - watermark[min] = min_free_kbytes / 4
It can be seen that these water levels of LInux are inseparable from the parameter min_free_kbytes. The importance of min_free_kbytes to the system is self-evident. It cannot be too large or too small.
If min_free_kbytes is too small, the water level buffer between [min, low] will be very small. During the kswapd recycling process, once the upper layer applies for memory too fast (typical application: database), it will cause the free memory to be extremely small. It is easy to fall below the watermark [min]. At this time, the kernel will perform direct reclaim (direct recycling), directly recycle in the process context of the application, and then use the reclaimed free pages to satisfy the memory request, so it will actually block the application. , bringing a certain response delay. Of course, min_free_kbytes should not be too large. On the one hand, if it is too large, it will reduce the memory of the application process and waste system memory resources. On the other hand, it will also cause the kswapd process to spend a lot of time on memory recycling. Look at this process again. Is it similar to the old generation recycling triggering mechanism in the CMS algorithm in the Java garbage collection mechanism? Think about the parameter -XX:CMSInitiatingOccupancyFraction, right? The official document requires that min_free_kbytes cannot be less than 1G (set to 8G in large memory systems), that is, do not trigger direct recycling easily.
So far, the memory recycling triggering mechanism of Linux and the first parameter vm.min_free_kbytes that we are concerned about have been basically explained. Next, let’s take a brief look at what Linux memory recycling recycles. Linux memory recycling objects are mainly divided into two types:
1. File cache, this is easy to understand. In order to avoid file data having to be read from the hard disk every time, the system will store hotspot data in the memory to improve performance. . If you only read the file, memory recycling only needs to release this part of the memory. Next time you read the file data, you can read it directly from the hard disk (similar to HBase file cache). If the files are not only read out, but also the cached file data is modified (dirty data), to recycle the memory, this part of the data file needs to be written to the hard disk and then released (similar to the MySQL file cache).
2. Anonymous memory, this part of memory has no actual carrier, unlike the file cache which has a carrier such as hard disk files, such as typical heap and stack data. This part of memory cannot be directly released or written back to a file-like medium during recycling. This is why the swap mechanism was developed to swap this type of memory out to the hard disk and load it out again when needed.
The specific algorithm used by Linux to determine which file caches or anonymous memory needs to be recycled is not of concern here. If you are interested, you can refer here. But there is a question that requires us to think about: Since there are two types of memory that can be recycled, how does Linux decide which type of memory to recycle when both types of memory can be recycled? Or will both be recycled? This brings us to the second parameter we care about: swappiness. This value is used to define how actively the kernel uses swap. The higher the value, the kernel will actively use swap. The lower the value, the less the use of swap. Positivity. The value ranges from 0 to 100, and the default is 60. How is this swappiness achieved? The specific principle is very complicated. To put it simply, swappiness achieves this effect by controlling whether more anonymous pages are recycled or more file caches are recycled during memory recycling. swappiness is equal to 100, which means that anonymous memory and file cache will be recycled with the same priority. The default 60 means that the file cache will be recycled first. As for why the file cache should be recycled first, you might as well think about it (the usual situation of recycling file cache It will not cause IO operations and has little impact on system performance). For databases, swap should be avoided as much as possible, so it needs to be set to 0. It should be noted here that setting it to 0 does not mean that swap is not executed!
So far, we have talked about the Linux memory recycling triggering mechanism, Linux memory recycling objects, and swap, and explained the parameters min_free_kbytes and swappiness. Next, let’s look at another parameter related to swap: zone_reclaim_mode. The document says that setting this parameter to 0 can turn off NUMA’s zone reclaim. What’s going on? When it comes to NUMA, databases are not happy again. Many DBAs have been cheated. So here are three small questions: What is NUMA? What is the relationship between NUMA and swap? What is the specific meaning of zone_reclaim_mode?
NUMA (Non-Uniform Memory Access) is relative to UMA. Both are CPU design architectures. Early CPUs were designed as UMA structures, as shown in the following figure (pictures from the Internet):
In order to alleviate the channel bottleneck problem encountered by multi-core CPUs reading the same memory, chip engineers have designed a NUMA structure, as shown in the following figure (picture from the Internet):
This architecture can well solve the problem of UMA, that is, different CPUs have exclusive memory areas. In order to achieve "memory isolation" between CPUs, it is also necessary Two points of support at the software level:
1. 内存分配需要在请求线程当前所处CPU的专属内存区域进行分配。如果分配到其他CPU专属内存区,势必隔离性会受到一定影响,并且跨越总线的内存访问性能必然会有一定程度降低。
2. 另外,一旦local内存(专属内存)不够用,优先淘汰local内存中的内存页,而不是去查看远程内存区是否会有空闲内存借用。
这样实现,隔离性确实好了,但问题也来了:NUMA这种特性可能会导致CPU内存使用不均衡,部分CPU专属内存不够使用,频繁需要回收,进而可能发生大量swap,系统响应延迟会严重抖动。而与此同时其他部分CPU专属内存可能都很空闲。这就会产生一种怪现象:使用free命令查看当前系统还有部分空闲物理内存,系统却不断发生swap,导致某些应用性能急剧下降。见叶金荣老师的MySQL案例分析:《找到MySQL服务器发生SWAP罪魁祸首》。
所以,对于小内存应用来讲,NUMA所带来的这种问题并不突出,相反,local内存所带来的性能提升相当可观。但是对于数据库这类内存大户来说,NUMA默认策略所带来的稳定性隐患是不可接受的。因此数据库们都强烈要求对NUMA的默认策略进行改进,有两个方面可以进行改进:
1. 将内存分配策略由默认的亲和模式改为interleave模式,即会将内存page打散分配到不同的CPU zone中。通过这种方式解决内存可能分布不均的问题,一定程度上缓解上述案例中的诡异问题。对于MongoDB来说,在启动的时候就会提示使用interleave内存分配策略:
WARNING: You are running on a NUMA machine. We suggest launching mongod like this to avoid performance problems: numactl –interleave=all mongod [other options]
2. 改进内存回收策略:此处终于请出今天的第三个主角参数zone_reclaim_mode,这个参数定义了NUMA架构下不同的内存回收策略,可以取值0/1/3/4,其中0表示在local内存不够用的情况下可以去其他的内存区域分配内存;1表示在local内存不够用的情况下本地先回收再分配;3表示本地回收尽可能先回收文件缓存对象;4表示本地回收优先使用swap回收匿名内存。可见,HBase推荐配置zone_reclaim_mode=0一定程度上降低了swap发生的概率。
不都是swap的事
至此,我们探讨了三个与swap相关的系统参数,并且围绕Linux系统内存分配、swap以及NUMA等知识点对这三个参数进行了深入解读。除此之外,对于数据库系统来说,还有两个非常重要的参数需要特别关注:
1. IO调度策略:这个话题网上有很多解释,在此并不打算详述,只给出结果。通常对于sata盘的OLTP数据库来说,deadline算法调度策略是最优的选择。
2. THP(transparent huge pages)特性关闭。THP特性笔者曾经疑惑过很久,主要疑惑点有两点,其一是THP和HugePage是不是一回事,其二是HBase为什么要求关闭THP。经过前前后后多次查阅相关文档,终于找到一些蛛丝马迹。这里分四个小点来解释THP特性:
(1)什么是HugePage?
网上对HugePage的解释有很多,大家可以检索阅读。简单来说,计算机内存是通过表映射(内存索引表)的方式进行内存寻址,目前系统内存以4KB为一个页,作为内存寻址的最小单元。随着内存不断增大,内存索引表的大小将会不断增大。一台256G内存的机器,如果使用4KB小页, 仅索引表大小就要4G左右。要知道这个索引表是必须装在内存的,而且是在CPU内存,太大就会发生大量miss,内存寻址性能就会下降。
HugePage就是为了解决这个问题,HugePage使用2MB大小的大页代替传统小页来管理内存,这样内存索引表大小就可以控制的很小,进而全部装在CPU内存,防止出现miss。
(2)什么是THP(Transparent Huge Pages)?
HugePage是一种大页理论,那具体怎么使用HugePage特性呢?目前系统提供了两种使用方式,其一称为Static Huge Pages,另一种就是Transparent Huge Pages。前者根据名称就可以知道是一种静态管理策略,需要用户自己根据系统内存大小手动配置大页个数,这样在系统启动的时候就会生成对应个数的大页,后续将不再改变。而Transparent Huge Pages是一种动态管理策略,它会在运行期动态分配大页给应用,并对这些大页进行管理,对用户来说完全透明,不需要进行任何配置。另外,目前THP只针对匿名内存区域。
(3)HBase(数据库)为什么要求关闭THP特性?
THP is a dynamic management strategy that allocates and manages large pages during runtime, so there will be a certain degree of allocation delay, which is unacceptable for database systems that pursue response delays. In addition, THP has many other disadvantages. You can refer to this article "why-tokudb-hates-transparent-hugepages"
(4) How much impact does turning off/on THP have on HBase's read and write performance?
In order to verify how much impact THP turning on and off has on HBase performance, I did a simple test in the test environment: the test cluster has only one RegionServer, and the test load is a read-write ratio of 1:1. THP has two options: always and never in some systems, and an additional option called madvise in some systems. You can use the command echo never/always > /sys/kernel/mm/transparent_hugepage/enabled to turn off/on THP. The test results are shown in the figure below:
As shown in the figure above, in the TPH shutdown scenario (never) HBase has the best performance and is relatively stable. In the scene where THP is turned on (always), the performance drops by about 30% compared to the scene where THP is turned off, and the curve jitters greatly. It can be seen that remember to turn off THP online in HBase.
Related recommendations: "Linux Video Tutorial"
The above is the detailed content of what is linux swap. For more information, please follow other related articles on the PHP Chinese website!