Business is growing exponentially, can usability construction be so stable?-Safety-php.cn

1. Problems and Challenges

Business is growing exponentially, can usability construction be so stable?

#Since 2017, vivo’s machine scale and number of services have grown significantly, as can be seen in the chart. The size of the machine has increased by about five times, and the number of services has basically increased by more than ten times. The time span is from 2017 to 2022.

Business is growing exponentially, can usability construction be so stable?

As the scale grows, the challenges and complexity will definitely increase. Typical challenges in vivo are mainly divided into change challenges and failure challenges.

1. Change Challenge

There are still more or less manual change scenarios in the change;

Our single release time is relatively long;

There are many scenarios of large-scale business migration;

Google SRE has such a concept: 70% of failures are caused by changes. This situation also exists in vivo, and changes will have a great impact on online stability.

2. Failure challenges

Rapid business growth has significantly increased capacity requirements.

Under this challenge, we divided the construction into two dimensions: availability capability and availability stage to ensure the stability of the business.

2. Availability Capability Building

1. Fault-based full life cycle development

Business is growing exponentially, can usability construction be so stable?

Our availability capability building is based on full-cycle fault management, covering fault occurrence, discovery, response, and recovery , review and preventive measures. The time from the occurrence of a fault to the recovery is called MTTR; the time from the recovery to the occurrence of a fault, from stable to unstable is called MTTF; the time between fault occurrences is called MTBF, with a total of 3 indicators.

Fault management is nothing more than these 4 points:

How to find the fault as soon as possible?
How to quickly cure the fault?
After the fault is restored, how to follow up?

Mainly considering business availability, you need to pay attention to the frequency of failure and the time it affects the business. Therefore, reducing the frequency of faults, quickly locating faults, shortening the duration of faults, and achieving rapid fault cure are the general ideas of our entire high-availability capability construction. Let me introduce to you the measures we have put in place:

#2. Fault occurrence analysis

First of all, it is necessary to realize To prevent faults, we must first understand why faults occur, which can be viewed from a service perspective and a full-link perspective.

1) Service perspective

Business is growing exponentially, can usability construction be so stable?

##A service is nothing more than a requested input, and normally it only needs a corresponding output. In real situations, there are many aspects that affect the correct response of the service. In some classic scenarios, the influencing factors have been summarized

In terms of capacity: exponential growth in business requests will lead to abnormal output of a single service;

On the service side: there is a bug in the software itself, and the service crashes as a result;
Hardware side: Abnormalities caused by host hardware, computer room, and network.

2) Full-link perspective

Business is growing exponentially, can usability construction be so stable?

Capacity layer: A sudden increase in requests and insufficient capacity of the entire link lead to service anomalies;

Service layer: Collaborative configuration is required between services. Incorrect configuration settings can also cause full-link abnormalities;
Upstream and downstream dependencies: Abnormalities in some key services can cause abnormalities across the entire link.

From the perspective of the stability of the entire link: upstream and downstream dependencies, insufficient capacity, and abnormal service configurations are all important factors affecting stability.

3. Fault prevention construction

After analyzing the fault factors from the two perspectives of service and full link, the fault There are corresponding ideas for prevention construction:

Business is growing exponentially, can usability construction be so stable?

Full-link abnormality: It is necessary to analyze the strength and weakness of upstream and downstream, and provide special protection for key servers , to ensure the stability of the entire link;
Change exceptions: establish change process specifications and change management platforms;
Infrastructure exceptions: rely on high-availability architecture, remove single point risks, and Good redundancy and disaster recovery.

4. Fault prevention

Business is growing exponentially, can usability construction be so stable?

I talked about the overall analysis and construction ideas before. How does vivo actually do it?

We have implemented construction guarantees based on the entire link. The entire link has been constructed from the access layer, business logic layer, middleware layer, storage layer, and infrastructure layer:

1) Unitization: Reduce service calls across computer rooms to avoid the failure of a single computer room from affecting all computer room services;

2) More Entrance: In the past, many businesses only had a single access layer entrance. After building the multi-entry capability of IDC and public cloud, the impact of a single entrance exception on the overall service access will be smaller;

3) Overload protection: When the business capacity suddenly increases, the access layer service can actively reject some burst requests according to the settings to prevent excessive request traffic from overwhelming subsequent services;

4) Circuit breaker downgrade: Monopoly downgrade of dependent services can shield the impact of abnormal services and avoid the avalanche effect.

5. Fault discovery

Business is growing exponentially, can usability construction be so stable?

## We have built a fault detection capability based on the entire link, and currently the proactive fault detection rate can reach 90%, which includes client monitoring, server monitoring and basic monitoring:

1) Client monitoring: self-built dial-up test system, monitoring the availability of each service through bypass simulated user access;

2) Server monitoring: Including domain name monitoring, log monitoring and call monitoring between services. According to the monitoring implementation method, it is mainly metrics/logs/trace;

3) Basic monitoring: monitor the hardware resource usage of the host situation, mainly in the form of metrics.

#6. Troubleshooting

Mainly includes fault analysis and fault handling.

Business is growing exponentially, can usability construction be so stable?

##Fault analysis: Linked with the monitoring system to support basic service fault analysis, Domain name availability analysis, etc.;

Troubleshooting: Failure plan construction, including plan formulation, drills, etc.

7. Fault recovery

Fault recovery is very important in the entire high availability construction cycle important part.

Business is growing exponentially, can usability construction be so stable?

We use business-based SLA grading to ensure business stability in a targeted manner. And record every fault of the business, improve and verify capacity building:

1) Business classification: Operation and maintenance resources are very limited, ensuring that all businesses have the same SLA, so classification Guarantee is very necessary. Based on the reputation and revenue of the business, we divide it into four business levels: core, important, general, and other. This guides the operation and maintenance manpower and guarantee efforts invested in each business;

2) Fault record: Improve review efficiency, and track online business faults for subsequent analysis to guide business optimization;

3) Fault improvement : Conduct backward verification based on chaos engineering to determine whether the improvement measures have taken effect.

This is our practice in fault review. We have also implemented these capabilities and practices into the platform and managed the fault review work through the platform.

8. Capacity management

Business is growing exponentially, can usability construction be so stable?

##Many online failures are caused by capacity issues. After capacity resources are in place, availability can be guaranteed to a certain extent. In this regard, we have mainly improved our capabilities in two aspects: resource elastic scalability and resource delivery operations. management capabilities.

Resource elastic scalability: Build hybrid cloud-based resource guarantee capabilities to greatly improve resource elasticity;

Resource delivery, operation and management capabilities : Establish a management mechanism for the entire life cycle of resources to ensure the maximum supply and use efficiency of resources, including budget management, demand management, procurement management, and inventory operation management.

3. Usability phase construction

After usability capability building, we divide it into three stages to build usability: Standardization stage , process stage and platform stage.

1. Standardization stage

Business is growing exponentially, can usability construction be so stable?

##Why should we build standardization?

Standardization can greatly reduce the complexity of business operation and maintenance, thereby reducing operation and maintenance costs. We have done a lot of standardization work at both the hardware and software levels.

Software level: OS standardization, host environment standardization , service catalog standardization, Agent standardization, access to nginx cluster standardization, and service capability standardization (middleware services).

2. Process and standardized construction

Business is growing exponentially, can usability construction be so stable?

##First of all, we will condense the best practices and methods in the operation and maintenance process into process mechanisms and specifications to ensure business stability is orderly and controllable, including operation and maintenance military regulations. , fault response mechanism, public affairs specifications, large-scale event guarantee specifications, etc.

For example, when the guarantee specifications for large-scale events are not established, such as when there are large-scale operational activities or Spring Festival red envelope distribution activities, it is easy for online failures to occur. Since 2018 After establishing the guarantee standards for large-scale events, heavy insurance such as the Spring Festival can ensure smooth operation.

3. Platform and system construction

Business is growing exponentially, can usability construction be so stable?

##In terms of platform and system construction, CMDB is used as the base to further develop the usual better process mechanisms into platforms, such as change platforms, monitoring platforms, service tool platforms, etc., to support business stability. .

4. Availability results and prospects

By 2022, the overall business stability operation and maintenance will be orderly and efficient, and business availability will increase from the previous level. Three nines have been increased to four nines now, and the number of businesses that meet the standard has also increased from eight before to 24 now.

Business is growing exponentially, can usability construction be so stable?

To achieve this usability result is mainly through usability capability building and usability phase building:

Availability capability building: fault prevention, fault discovery, fault cure, fault review

Business is growing exponentially, can usability construction be so stable?

In the future, we will focus on remote multi-activity, container/cloud native Availability guaranteed.

Business is growing exponentially, can usability construction be so stable?

Taking the availability guarantee of containers and cloud native as an example, we have more It is a pure physical machine. Later, virtual machines were added, and then public cloud was added, which further reduced the direct dependence on the underlying infrastructure. At the same time, we are also working on containers and cloud native to unitize resources and flexibly schedule them to reduce the need for resources. Direct dependence on physical hardware resources, so we need to build high availability capabilities for different infrastructures.

What else can be done to build usability?

Business is growing exponentially, can usability construction be so stable?

## I personally think that we not only consider availability, business quality and operating costs These are all things we need to consider. The operation and maintenance guarantee of the business will then enter the stage of refined operation guarantee.

Q&A

Q1: What are the biggest difficulties encountered during the implementation of usability construction?

A1: The first point is the construction specifications of the underlying technical capabilities. Failure to comply with these specifications will lead to great uncertainty in the business availability results, so certain rules must be formulated for the team. standards, and at the same time, there must be a certain bottom-keeping mechanism;

The second point is the recognition from the upper level. Each business has different demands at different stages, and the stability is different. Well, it will affect business, reputation and revenue. After being recognized by the upper management, usability construction will be easier to promote.

Q2: During the implementation of CMDB, in addition to the development person in charge, host and other information, what other information did your company associate in the actual process? For example, is it related to middleware information?

A2: Many of our systems are currently based on CMDB. Not only the operation and maintenance system, many systems are built based on CMDB, and middleware services will also be integrated with CMDB. Association construction, such as dubbo in microservices, is also based on CMDB for service discovery and governance.

Lecturer Introduction

Zhou Jiali is now the operation and maintenance director of vivo, responsible for the operation and maintenance of vivo’s Internet business. This person who has worked at Baidu and Tencent has experience in offline business operation and maintenance such as client, internationalization and big data algorithms. After joining vivo, I led the construction of business high availability and improved the business availability to 99.99% level.

The above is the detailed content of Business is growing exponentially, can usability construction be so stable?. For more information, please follow other related articles on the PHP Chinese website!