Introduction | SRE is the abbreviation of Site Reliability Engineering. It is a new operation and maintenance model that evolved from Google’s internal product technical support process and defines the scope of responsibilities of the new position. Different from the traditional operation and maintenance model, SRE emphasizes automated systems and advocates the development of some scenario-based automated operation and maintenance tools through software engineering methods to replace repetitive and manual operations. In this Chat, we will introduce the evolution of SRE automation through some cases of foreign SRE practices. |
content include:
The value of automated systems to SREs;
The evolution of automation systems;
Foreign Internet enterprise SRE automation application cases;
Automation practice in domestic operation and maintenance field.
SRE is the abbreviation of Site Reliability Engineering. It is a word or a newly defined position originating from foreign Internet companies. In the era of the traditional system administrator model, we called this role operation and maintenance, and abroad it was called Operation.
The VP of Google SRE is Ben Treynor. When he joined the company in 2003, his first task was to form a 7-person "production operation and maintenance team." But he soon discovered that based on the speed of the increase in Google machines, the traditional operation and maintenance model could not quickly meet the reliable operation and maintenance requirements. Since he is a senior software developer himself, he formed the operation and maintenance team just like a R&D team. We have recruited many R&D engineers who have development capabilities and some knowledge of system management. The most important thing is that they despise repetitive work. They solidify some best practices, methods, processes, and methods into code, and use this method to cope with scale expansion and increase in complexity.
2. The value of automation system to SRETypical SRE activities are divided into four parts: software engineering, system engineering, trivial matters, and process burden. Among them, we can see that there is a type of work that is directly related to daily operation and maintenance services, but it is defined as inefficient work in SRE, and Google still uses a special word - Toil to describe it.
Trivia is manual, repetitive, and tactical work in operation and maintenance services. Its growth is linear with the growth of services. This part of the work can be automated. Google has publicly proposed that SREs should ensure that at least 50% of their time is spent on software engineering projects, because if not controlled, trivial matters will become more and more numerous and quickly occupy most of the SRE personnel's time. The work of reducing trivial matters and expanding service scale is the E (Engineering) in SRE.
From this video we can see that the value of automation to SRE mainly comes from two aspects: performance and efficiency. When it comes to automation, the first thing that many people think of is the improvement of efficiency. In fact, compared to simply improving efficiency, SRE personnel emphasize the balance between system performance and speed and flexibility. Automation ensures system performance by eliminating inconsistencies caused by manual execution in operations and ensuring "consistent execution of procedures with a clear scope and known steps". This is the primary value of automation.
The automation system can provide a scalable and widely applicable platform. The platform can centralize problems, handle system errors on a large scale, and run tasks more continuously and more frequently than humans can. And because the platform can expose its own performance indicators, it can also help us discover details that were not easily noticeable in the previous process. Of course, the basis of platformization lies in correct design and implementation. It is easier for us to understand the improvement that automation brings to efficiency. Although we often compare and analyze the effort and time spent on writing an automated program and the part saved by canceling manual work, we should see that once automation is implemented, a certain operation will be decoupled from the specific operator, so when we proceed When measured, the time and effort saved by automation should accrue to all users.
An SRE responsible for the Google data center cluster online process, Joseph Bironas, once said: "If we continue to produce processes and solutions that cannot be automated, we will continue to need people to perform system maintenance. If we have to hire people Doing this job is like feeding the machines with human blood, sweat and tears. It's like a Matrix world without special effects but full of angry sysadmins."
3. The evolution of automationThe automation process of Google SRE has gone through the above stages. The first stage is a non-automation stage that relies entirely on manual operations, and then uses externally maintained system-specific automation scripts to operate. Specific system automation gradually evolves into general system automation, and then replaces externally maintained automation systems with internally maintained automation systems. The automated system eventually evolved into an automated system that is incorporated into the operation and maintenance platform and does not require manual triggering.
4. SRE automation application cases of foreign Internet companiesGoogle’s resource management system Borg is a typical automated application release system used by Google SRE for a long time. Why resource management is so important? Because the scale is too large, operating costs become the only obstacle to evolution. Technically speaking, a unified resource management system is very difficult to implement, and the quality of the infrastructure determines the capabilities of this system. Especially in a distributed environment, the requirements for physical servers in different business scenarios are not exactly the same. The prerequisite for Google's Borg to achieve unified resource management is the support of core technologies such as GoogleFS, BigTable, Chubby, GSLB, etc. SRE is the user of this system and constantly provides feedback and improves the use of the Borg system for system reliability. Requirements, so far Borg is still the application publishing system used internally by Google.
First of all, the Borg system is a completely layered system architecture. From the most basic file system to the top load offloading, each layer of the technology stack is unique within Google. The advantage of this is that experience can be accumulated and reused. The system architecture of domestic enterprises will also go through a hierarchical organizational structure in the development process. Putting aside human factors, many hierarchies are built by combining multiple systems. On the surface, we have reduced costs, but in fact we have increased manpower maintenance costs. At this point, the advancement of foreign systems can be put aside. What should we do when choosing technology? Speaking from experience, the one-layer system composed of multiple open source systems shared by peers is a shortcut method with obvious short-term effects. Once an enterprise's business develops at a rapid pace, every refactoring is a devastating tool for overturning and starting over. From my experience in various enterprise systems, large and small, I deeply understand this dynamic of change. In SRE, the idea of changing tools is not to replace old open source tools with new open source tools, but our reliability requirements should simplify the number of tool selections, and truly consider our own needs on this basis. In the end, we must The road to self-developed automation systems.
Second, the infrastructure technology of the Borg system is advanced enough. Is SRE a bit redundant? Obviously, technological advancement cannot replace SRE methodology. The most popular DevOps concept in the industry currently does not include more descriptions of cost and reliability. It focuses on various practices such as automation and improving productivity. These practices cannot solve the core issue of sustainable development of business scenarios, which is the check and balance between business reliability and cost control. The SRE method is to obtain maximum business benefits at the lowest cost. Therefore, the SRE position recruits a system operation and maintenance engineer who can write code. If you only do operation and maintenance, you will definitely not be able to retain a pure developer. Therefore, we should raise our cognitive latitude and solve the current internal business system of the enterprise from the perspective of software engineering. From the author's personal experience, we are a technology company that develops products, including testing systems, project management systems, process control systems, release systems, etc. No matter how big or small your company is, you will need it. Without an SRE driver, we would choose a tool to fill the gap. However, systems are not related to each other, and no one internally can really drive the iteration of this matter. Finally, we let operation and maintenance or development simply solve this problem. The actual situation is that this problem cannot be completely solved.
Third, SRE can be seen everywhere in foreign Internet companies, but there are very few such positions or dissemination of ideas in China. Is this due to cultural differences? The author believes that in the process of continuous evolution of the domestic operation and maintenance system, the development speed is definitely slower than the current cognitive level in foreign countries. But with the rise of Taobao, Alibaba's technical support department is actually the best verification of domestic SRE. The benefits of SRE are very obvious, but it will be very difficult to promote the company among small and medium-sized enterprises. The core problem is that the foreign technical service provider system is very sound. When small and medium-sized enterprises want to make some SRE transformation, they can obtain the solutions of a large number of technical service providers. And companies are willing to spend part of the cost on the pre-research process of such technologies. Domestic companies expect to purchase mature technologies and are unwilling to invest energy in infrastructure. Cost control is also based on labor cost considerations, and it is difficult for technology providers to have room to operate. So under such a predicament, the development of cloud computing business can play a lubricant role. In other words, the shared technology economy may be a way for SRE to be implemented in China. For example, Shuren Cloud, where the author works, is a lightweight application management platform that implements the SRE concept. Through cooperation with enterprises, it completes the platform construction required by enterprises. In this cooperation process, the evolved system serves as added value and is promoted by Shuren Cloud Platform in other enterprises to achieve a win-win situation. Judging from the results, enterprises have achieved successful SRE practice results, and technical service providers have gained opportunities to practice SRE.
Fourth, SREs make good use of tools. The change in the way we solve problems, from problem solving to in-depth analysis of the problem, and giving a model and checklist to solve the problem. For example, the list provided by Netflix's SRE to SRE when solving Linux system performance:
5. Automation practice in domestic operation and maintenance fieldLimited to the rapid iteration of development in the domestic operation and maintenance field, the author will break down the current status of automation practice from the three areas of greatest concern as a breakthrough point.
1. Monitoring alarm
There are many kinds of domestic monitoring and alarm tools, but there are very few popular solutions that can be implemented. The most commonly used one by traditional enterprises is Zabbix. In addition, Open-Falcon, the open source Internet enterprise-level monitoring tool developed by Xiaomi in China, is also an option. But in both scenarios, there is no way to avoid a very direct question, which is how to use the shortest path to analyze your problem and solve the actual problem in the business scenario. From the perspective of monitoring, there are multiple dimensions: system level, business level, and service level. When dealing with problems from the perspective of SRE, capacity planning is the first step, rather than planning based on various system lags based on a priori experience. So from the beginning, the tools were not the toughest issue. Take Zabbix as an example. The dimensions that can be monitored are the health of the system, the QPS of the database, and the memory of Redis. But if the website slows down, there is nothing I can do from a monitoring perspective. A full-link analysis is necessary to determine the problem and solve it. If we follow DevOps experience, we are unlikely to ask this kind of question, but when encountering a problem, how to automatically switch servers or automatically expand the capacity to solve the problem. Cost control is uncontrollable in a DevOps scenario. Managers can only force budget costs, and neither upstream nor downstream can fully understand how much business operations cost.
2. Log monitoring
Domestic log monitoring uses the ELK (Elasticsearch, Logstash, and Kibana) technology stack extensively. This technology stack is very popular and has also solved a large number of internal log problems in enterprises. But in actual scenarios, business log management is still very painful. One is the query of real-time logs, and the second is the aggregation of historical logs. How to effectively provide the use of log query? The Qunar operation and maintenance team has shared a method. By providing on-demand ELK services for internal departments on Apache Mesos, developers can query their own business logs and analyze their own at any time. log. After the query is completed, the ELK service instance is automatically destroyed. The author believes that this innovative approach is actually the practice of SRE thinking.
3. Continuous integration and release system
The most commonly used tool for domestic operation and maintenance is to use Jenkins to complete continuous integration and release. But we often stop in-depth practice as long as we can use Jenkins. From an SRE perspective, we first analyze what are the business pain points of continuous integration, because during the process of continuous integration, the test system will be connected. Therefore, the author believes that the purpose of continuous integration is to continuously improve product quality. With the core goal in mind, what we control is not just how Jenkins jobs are managed, but whether the efficiency of testing can be improved and the integration time can be shortened. Establishing a goal list and then incorporating it into the SRE improvement process will definitely produce different results. Continuous publishing is another topic. In fact, the problem is that users do not fully understand publishing. Release includes grayscale release, test release, rolling release, rollback release and other scenarios. And every scenario should be reversible. How to solve this problem? Jenkins alone cannot solve this problem. You need extended tools to satisfy it, such as the assistance of a set of lightweight application management platforms.
6. SummaryJudging from the development history of the industry, the standardization of technology is an inevitable evolutionary process, and operation and maintenance automation is actually a manifestation of standardization. Starting from the first step of getting started with SRE, work responsibilities should be sorted out and sorted out, and problems that need to be solved should be documented into a checklist. Facilitate rapid implementation in business. The next step is to visualize these business indicators and scenarios to help the company reduce operating costs and quantify the goals of the service system. For capable enterprises, the interfaces of various resources will be unified during the development process. This process will be very long. From the author's experience, it should be iterated in small steps and implemented carefully according to the actual results. Because platformization regardless of cost is just a glorious political achievement and does not effectively solve the cost problem. The highest form of automation is definitely an intelligent system, but from the author's point of view, maybe everyone has watched too many science fiction movies, which has diluted the purpose of software engineering, which is to use scientific methods to maximize software benefits. But it is definitely not a highly intelligent self-healing system. In the author's opinion, this artificial intelligence system is another dimension of software engineering, just like the comparison between Nokia mobile phones and Apple mobile phones. The SRE model solves how to use the current tool chain to maximize the value of Nokia mobile phones. , rather than being replaced by an artificial intelligence system. Perhaps, one day in the future, SRE will directly retire and let robots replace the entire operation and maintenance system, but SRE will eventually leave a deep mark on the history of technology.
The above is the detailed content of Application of automated evolution in SRE. For more information, please follow other related articles on the PHP Chinese website!