Editor's note: Boss Jing was the boss of my team when I joined Baidu in 2011. He is a hard-core veteran. It was not easy to seize this opportunity. He asked all the common questions in the industry for the benefit of readers. Boss Jing has a free and easy nature, and his jokes and curses are all written down, and his principles are easy to understand. Here is the first issue of the down-to-earth and high-level "Operation and Maintenance Forum", let's start!
Guest introduction
Jingyuan, first from left, former Baidu operation and maintenance architect, former Xiaomi Person in charge of operation and maintenance, former Meicai CIO
Some operation and maintenance personnel reported that the company knew very little about the value of operation and maintenance. How did you clearly explain the value of operation and maintenance to the company back then?
First of all, you need to explain clearly to the company the job responsibilities of operation and maintenance (what operation and maintenance does, what it produces) and key indicators (measuring output results), such as working around stability, safety, efficiency, etc. Expand, what operation and maintenance projects have been carried out, and how to proactively promote the achievement of key indicators.
Key indicators include not only service availability, but also server resource compliance rate, service failure data (fault classification, fault response time, mean fault recovery time, fault alarm coverage), service security indicators, service How long the resources will be available, etc.
For example, build a complete monitoring system:
Monitor server resource usage, find servers with substandard usage, recycle or reallocate resources, through virtualization, containerization and other means Improve resource utilization, sort out alarm thresholds, and standardize P0, P1, P2, and P3 alarm levels; the monitoring system provides alarm merging, intelligent positioning suggestions, active alarm aggregation, and time-latitude alarm analysis. Convenient and faster alarm response and fault location, improve alarm and plan sorting of fault response time, fault recovery time and other services, shorten the mean fault recovery time, and improve fault alarm coverage
Some opinions in the industry believe that The rise of infrastructure such as cloud and Kubernetes will gradually eliminate operation and maintenance positions. What do you think of this view?
Many years ago, the slogan of our operation and maintenance team was NO Ops, and the blog was noops.me.
It has been said for a long time that operation and maintenance positions will gradually disappear, or some job responsibilities will disappear. Take system operation and maintenance as an example. The previous management team required a team of 20 people including server engineers, kernel engineers, network engineers, CDN engineers, and computer room operation and maintenance engineers. Later, with the introduction of public cloud, the team only had 4 people, including 1 cloud resource administrator, 1 CDN scheduling engineer, 1 network engineer, and 1 kernel engineer. They only needed to manage and schedule the resources and services provided by third-party companies. Can.
With the popularization of K8s and cloud, and the continuous maturity of R&D code engineering, operation and maintenance will be less and less involved in this process. When the deployment framework is mature, in order to save operation and maintenance manpower and improve deployment efficiency, the deployment of second- and third-level services has been left to R&D self-service.
With the development of science and technology and the changes of the times, it is normal for a position to disappear. Making timely adjustments and planning is the focus of thinking.
In the current environment where enterprises are moving to the cloud on a large scale, what adjustments do you think operation and maintenance personnel should make to better meet the current talent needs?
In the cloud environment, operation and maintenance engineers should be more business-oriented and architecture-oriented, expand their business scope, and become key talents to ensure business stability. If it is still the same as before, only focusing on monitoring alarms and only responsible for service deployment changes, then it will definitely be eliminated.
On the other hand, you can go in the direction of specialization, become an expert in a certain field (monitoring, big data, K8s, database, etc.), and become an operation and maintenance R&D expert.
Life advice, look for more side jobs, operation and maintenance work is only a small part of life.
AIOps has been hyped for several years, but its popularity has obviously become quieter recently. Do you think companies should implement AIOps at this stage? What issues should we pay attention to?
Take smart monitoring as an example. I have seen a lot of copywriting saying that AI should be used to predict faults and intelligently locate. I haven't seen any reliable cases so far. In an Internet business system where services are changing faster, dependencies are complex, and there are many factors affecting faults, if it is really possible to achieve fault prediction through historical data. It is better to do earthquake prediction. Thousands of years of earthquake data accumulation can produce great social value.
The prerequisite for doing AIOps is to really understand AI and understand the principles of machine learning and neural networks. There is as much intelligence as there are artificial intelligence, and AIOps capabilities are not a slogan.
Do you think AI capabilities like chatGPT will be able to solve problems in the operation and maintenance industry in the future?
For example, in fault management, based on the faulty equipment, data, description, and through the knowledge base, historical fault database, etc., possible auxiliary suggestions (suggestbot) for the fault are given
BTW, if you can already play with chatGPT, invest this technology in other areas that can generate more value. Don’t always waste it in the field of operation and maintenance...
There is endless debate in many companies about whether the deployment of business programs should be left to R&D or operation and maintenance. What do you think of this issue?
As mentioned before, our second- and third-level services are entirely provided by R&D, while the first-level services are provided by operation and maintenance and R&D in turn. The main purpose is to let operation and maintenance know the current services. Just the changes. When operation and maintenance personnel do deployment at the beginning of the company, they focus more on standardizing the online environment and standardizing service deployment methods, so as to better develop and deploy systems and control the service architecture they are responsible for.
Security issues and process issues can be completely solved by deploying the system. In terms of operation and maintenance, don’t cling to this work that has no value and no accumulation.
What is the thing you most want to say to the (operation and maintenance) industry? Why?
"Physics does not exist, but the physics we think may not exist." The operation and maintenance industry may not exist anymore. How many operation and maintenance people's dream is AIOps, NOOps, or their own Kill this industry or be killed in this industry.
When it comes to tool selection, how do you decide whether to develop it yourself, use open source, or use commercial products?
If you have the ability and time, use open source, and if the ability and time are limited, use commercial products. If you have money, leisure and are very conceited, you can try self-study.
Does your company also have a multi-cloud architecture? Which capabilities do you think should be relied upon by cloud vendors in multi-cloud scenarios and which capabilities should be built in-house?
We are a multi-cloud architecture. Dedicated lines or data transmission capabilities need to be built by yourself. Public capabilities based on multi-cloud can also be built by ourselves, such as monitoring systems, data backup systems, deployment systems, microservice core components, etc., and the rest can be left to cloud vendors.
What is your most memorable failure? What inspiration does it have for you?
After so many years of operation and maintenance, we have encountered too many weird failures, and the root cause is beyond your imagination. It can only be said that failures are difficult to avoid, and we can only try to reduce the frequency, impact area and impact time of failures.
So your performance is not the number of failures and failure levels, but the impact of failures, failure response, recovery time, etc.
Faced with the rapid development of basic technologies, do you have any career planning suggestions for operation and maintenance personnel who have just entered the industry and those who have been in the industry for a long time?
It’s quite extreme~ For those who have just entered the industry, it is recommended to change careers as soon as possible! For those who have been in the industry for a long time, it is relatively difficult to change careers in technology, and it has been deeply imprinted on operation and maintenance. I have seen too many operation and maintenance personnel switch to other technologies. Most of them are operation and maintenance R&D and operation and maintenance product manager positions. It is better to find a side job.
What do you think is the difference between traditional operation and maintenance and SRE? What was the thinking behind your team's transformation?
It’s already 2023. Talking about this topic is like setting up a NOC monitoring duty for Internet operation and maintenance, going backwards.
If you are still considering whether to transform SRE, how to transform SRE, and the changes in SRE, just like in the 5g era, if you are still considering whether to use 2g or 3g... you will be eliminated by the times.
Do you feel like it’s coming to an abrupt end? Haha, this is the first issue of "Operation and Maintenance Forum". We will continue to invite industry leaders to share. The more different opinions there are, the more interesting it is and the more it can trigger thinking. Let's work together with an open mind. , listen to the opinions of hundreds of schools of thought. See you next time!
The above is the detailed content of Well Source: Operational and Maintenance Geometry. For more information, please follow other related articles on the PHP Chinese website!