Detailed explanation of Linux load average load problem-Linux Operation and Maintenance-php.cn

This article brings you a detailed explanation of the Linux load average load problem. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

In one interview, the interviewer asked a question. The CPU usage is not high, but the Load (average load) is very high. How do you find the problem?

I didn’t understand the meaning of Load at the time. The interviewer explained that this indicator reflects more processes in an uninterruptible state. Based on my past back-end development experience, I answered that there may be more io blocking in the system, which mostly occurs in network io problems. Use the command netstat -tnp to see if there is much time_wait status in the tcp connection...

I know My answer was very one-sided, so I reviewed and took notes afterwards.

What is load average

Those who are familiar with Linux know that you can use the top uptime command to view the load average indicator.

Use man uptime to view Load average Explanation:

System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.

Understand the key point, the average load refers to the unit Within a certain period of time, the average number of processes in the system that are in the runnable state and the uninterruptible state is referred to as the average number of active processes. It is worth noting that it has no direct relationship with CPU usage

Use the command ps aux to view the status stat of the process, as noted in this article:

R status, runnable status (Running status) / Runnable), the D state of the process that is using the CPU or waiting for the CPU, the uninterruptible state (Uninterruptitle Sleep, also known as Disk Sleep), the process that is in the critical process of the kernel state, and is uninterruptible.

D Why the state cannot be interrupted? For example, the system calls the I/O response of the hardware device. In order to ensure the consistency of the data, before the disk device returns the data, it cannot interrupt other processes or Interrupts are interrupted. If interrupted, it is easy to cause inconsistency between disk data and process data. Therefore, the uninterruptible (D) state is a protection mechanism of the system for processes and hardware devices.

The average number of active processes, strictly speaking, is the exponential decay average of the number of active processes (the rate of decline of a certain quantity is proportional to its value). Usually, it can be understood as the number of active processes per unit time.

CPU Utilization and Balanced Load

From a CPU perspective, Load average only reflects the number of processes occupying the CPU per unit time, and CPU utilization is not directly related to the number of processes. We can Use the command top vmstat to check the CPU utilization. There are the following indicators:

%us: Indicates the cpu usage of the user space program (not scheduled through nice) %sy: Indicates the cpu usage of the system space. Mainly kernel programs. %ni: Indicates the cpu usage of programs in user space and scheduled through nice. %id: idle cpu %wa: the time the cpu is waiting for io when running %hi: the number of hard interrupts processed by the cpu %si: the number of soft interrupts processed by the cpu %st: cpu stolen by the virtual machine

How to measure a reasonable average load

Generally speaking, if the Load average is lower than the number of CPUs, the machine performance meets the service requirements. It does not matter if it exceeds the number. The Load average does not directly represent the CPU utilization, and it may be due to more io blocking. . When the load average is higher than 70% of the number of CPUs, it may cause the process to respond slowly, thus affecting the normal function of the service.

From the perspective of historical changes

Generally speaking, top uptime provides load average indicators at three time points, namely: 1 minute, 5 minutes, and 15 minutes. This reflects the recent state change trend of the system. In the actual production environment, we need to make long-term monitoring records. If there are abnormal numerical changes, for example, the average load is twice that of the CPU, the problem needs to be analyzed and investigated.

Comprehensive analysis of the differences between the two types of indicators

is based on the balanced load and CPU utilization, and the following possible situations are combined:

Load average is high, CPU If use is high, either CPU-intensive processes (threads) are running, or there are a large number of processes (threads) waiting for the CPU to schedule. Load average is high, and if CPU use is low, IO-intensive processes are running. Both are relatively low, and normal load average is low. High CPU use, this does not exist

Simulation cases and tools

How can we analyze cases with different combinations of these two indicators, balanced load and CPU utilization, and find the source of the indicator changes?

The following environment is Linux Arch 4.19 / 4 CPU / 8G Memory

Tool list

stress system stress testing tool

sysstat performance analysis tool package:

mpstat Multi-core CPU analysis performance tool, mp means multi processors (multi-processor) pidstat process performance analysis tool, pid means process ID. It is used to view the CPU, memory, I/O and context switching indicators of the process

Simulation scenarios

Using stress can simulate the following scenarios

CPU-intensive processes

# 模拟一个进程， 对 cpu 使用率 100%，限时 600s stress --cpu 1 --timeout 600

Copy after login

IO intensive process

stress -i option, spawn N workers spinning on sync()

# 模拟一个进程不停的执行 sync stress -i 1 --timeout 600

Copy after login

Scenario of a large number of processes

# 模拟16个进程， 对 cpu 使用率 100%，限时 600s stress --cpu 16 --timeout 600

Copy after login

Tool indicators

mpstat -P ALL 5 monitors all CPUs and outputs a set of data every 5 seconds. Pay attention to the indicators %usr usage and %iowait IO blocking time. From this, you can determine whether it is CPU-intensive or IO-intensive pidstat - u 5 1 Statistics interval of 5 seconds, data of processes that have used the CPU, pay attention to the indicators %usr usage, %wait waiting time to use the CPU, from this you can determine whether there are too many processes (threads)

The above is the detailed content of Detailed explanation of Linux load average load problem. For more information, please follow other related articles on the PHP Chinese website!