The key to operational stability issues – availability-Linux-php.cn

The key to operational stability issues – availability

WBOY

Release： 2024-03-27 18:11:20

forward

1053 people have browsed it

The review is more based on the summary and improvement afterwards. So how do we find and measure stability problems? Then we need to bring out today’s protagonist—availability.

What is availability?

Availability is an important indicator for evaluating business stability. It can discover periodic problems in the business through data quantification and establishing baselines, and thereby improve service quality in a more targeted manner.

So, what is availability? Availability refers to the proportion of the total available time for a functional individual within a specified time interval. In other words, it refers to the probability or proportion of the system being able to operate normally within a specified period of time. Most of our current Internet businesses are "real-time" and "online", that is, Real-Time Online System. For most of our business, the designated time period mentioned above should be 7*24 hours.

Availability results are often expressed using decimal points or percentages. We usually use a measure called the number of nines, which corresponds to the number of consecutive nines after the decimal point. For example, "Five Nines" means that the system has 0.99999 (or 99.999%) availability within a specified period of time.

How to understand the corresponding magnitude?

For example, a system operates within a specified time period, such as 1 day, that is, 24 hours. At the same time, our monitoring granularity is minutes, that is, 1440 minutes. During the 1440 minutes we monitored, the system ran normally for 1430 minutes. Then within this specified time period, the availability of the system is 1430/1440≈0.99306 (99.306%). That is what we often call two 9s.

Then, the value 99.306% represents the proportion of the system in the normally available Availability state, and the value of 0.694% obtained from 1-99.306% represents the proportion of the system in the Unavailability state that is unable to handle exceptions. Simply listed as a formula, it is:

The total time the business is online = the normal availability time of the business and the abnormal unavailability time of the business

Going one step further, availability means:

Availability = normal availability time of the business / total online time of the business

The key to operational stability issues – availability

How to establish availability

Understanding what availability is, let’s talk about how to establish availability. There are many ways to establish usability, and there are several common methods:

Dial test method

The dial test method is a method of periodically testing whether the operating status of each business is normal based on its applications, functions, and modules.

For example: Our business has a module named A. Then we periodically (for example, once every 5 minutes) conduct random checks on the running status of this module by simulating user behavior. If the module is running normally, it is recorded as Availability; if it is abnormal, it is recorded as Unavailability. The proportion of Availability status accumulated within a time period (for example, 1 day) is the availability of this module.

So, how to judge whether the business or module is normal? Let's take a web-type business as an example. We can check the key content of the homepage, category page or content page under the service. Generally speaking, we can match the specified fields or keywords of the Head, Body, and Bottom of the specified page. If the specified field or group of fields or keywords can be matched, it is normal, otherwise it is abnormal. We can use scripts, Nagios, Zabbix and other tools to implement periodic testing of the business.

The advantages and disadvantages of this method are obvious. The advantage is that this method is less difficult to implement and can be measured by simulating user behavior, and it can also be more consistent with the actual business situation. However, through this periodic sampling method, there is the problem of insufficient or biased sampling samples. For example, a dial test is performed every 5 minutes. If the fault occurs and is repaired within these 5 minutes, it will be difficult for the dial test method to catch such errors.

Log analysis method

The log analysis method is a method to obtain availability by analyzing the application, function, and module logs of each business.

Example: Our business has a module named A, then the 1-hour log on this module will be analyzed periodically (for example, once an hour). The proportion of normal requests distinguished from the log level is the availability of this module in the past hour. Taking web-type business as an example, we can make statistics and analyze the 2XX and 5XX statuses respectively from the logs. We can understand that 2XX means Availability and 5XX means Unavailability. (3XX and 4XX can consider whether to participate in the analysis based on actual business conditions)

This method obviously solves the problem of insufficient or biased sampling samples in the dial test method, but there are also situations where the actual business impact index may be significantly different. For example, our errors in the past hour all occurred within 1 minute, and the remaining 59 minutes of business were normal. Obviously, there is a certain deviation between the availability obtained in this way and the actual business situation. So how to solve this deviation? The log analysis threshold method came into being.

Log analysis threshold method

The log analysis threshold method is an availability planning method that adds status threshold judgment based on the log analysis method.

For example: Our business has a module named A. We found through log analysis that the number of requests for this module under normal circumstances is about 100,000 times per minute. Then we can set a threshold of 10 times. What these 10 times mean is that we allow an error of less than one ten thousandth to occur within one minute. If the number of errors occurring within 1 minute is less than 10, we consider the status in the past minute to be normal and mark it as Availability. If more than 10 errors occur within 1 minute, then we consider the status in the past minute as abnormal and mark it as Unavailability. Finally, the ratio of Availability status is calculated to be the availability of this module. Of course, this threshold needs to be adjusted according to the actual situation of the business.

This method solves the problem of disconnection between the sample deviation of the dial test method and the actual business impact of the log analysis method, and achieves a good balance.

There is another question. If a business consists of three modules A, B, and C, how to calculate the availability of the business through the availability of the modules? The simple method is to use the average of the availability of the three most modules. But there is a problem with business objectives. Then we can use a weighted average method by aligning it with business goals. For example, if module A is more critical to the business, then we will give module A more weight when calculating availability; module C is a bypass system for the business, so we can reduce the weight of module C when calculating availability. By analogy, the availability we derive can be as close to the business and its goals as possible.

Other methods

We can also conduct more extensive testing of the business by using third-party test platform nodes such as Keynote and Borui to improve the accuracy of sample collection and reduce its deviation. Of course, the results are also limited by the impact of the third-party platform and the stability of the links
For businesses with clients, we can perform management on the critical path of the client, and then centralize the user's management logs to the server for centralized analysis. Although this method can reflect the most realistic user status, it also has problems such as relatively high implementation costs and delayed log uploads.

Write at the end

There are far fewer ways to calculate availability than the ones written above, and there is no single method that can solve all problems and pain points. Choose one or more methods that are most suitable for your business or team from the perspectives of cost, income, time, etc., and use them to continuously improve the service quality of your business.

The above is the detailed content of The key to operational stability issues – availability. For more information, please follow other related articles on the PHP Chinese website!