Last Friday, Ma Chi and Lai Wei had an online exchange. The topic was, are operation and maintenance positions really no longer available? As the host, I am both the igniter and the facilitator :) I benefited a lot from listening to the two veterans share some of their respective opinions. Make sure to record it today so as not to forget it. It is a review of the live broadcast.
The tool platform will replace part of the labor force. This is actually obvious and needs no further explanation.
But who will build the tool platform? This is worth checking out. Monitoring systems, CI/CD platforms, chaos engineering platforms, middleware services, etc. are all Platforms and are built by Platform Engineer, referred to as PE. PE is obviously divided into many groups, and each PE group is responsible for a limited number of platforms. These scattered PE teams can be organized into a large team, such as the infrastructure team, or they can be split into multiple teams. For example, the PE team related to engineering performance can be placed in one department (such as the performance engineering department), database, and big data. The relevant PE teams are placed in one department (such as the data department), and the PE teams related to stability assurance are placed in one department (such as the operation and maintenance department).
The division of this organization may be different in different companies, but the relationship is not very important. The key is how the PE team should carry out its work? The core of the PE team must do the following:
About external suppliers
Expanding on the issue of career selection, although there may not be a good supplier in a certain segment now, what about three years from now? What about 5 years from now? Have foreign countries already taken the lead? Are there any suppliers with good potential in China? If you already have it, brother, do you still dare to continue to devote yourself to this niche field? Should we have made some plans in advance?
Of course, we are usually too optimistic or too pessimistic about our future estimates. Our estimates of time are usually either too advanced or too late. That's right, brother, it depends on how you judge.
Should OnCall fault response be handled by R&D? Or operation and maintenance? This question is very interesting. Ma Chi believes that 80% of online faults are related to changes. Changes are made by R&D, and R&D is obviously more familiar with them. Let R&D respond to OnCall faults, which means that R&D can respond faster to 80% of the problems.
Business development is like this. Database changes, basic network changes, and access layer changes are all the same. It seems more reasonable for the person who makes the change to respond to the fault alarm of his own service.
Actually, this depends on two premises:
In fact, we can treat it in two situations. The service stability monitoring after the change is the responsibility of the person who made the change. Daily OnCall is another scenario and should be treated separately. So who should do the daily OnCall? It should be those who can directly participate in fault location and stop loss. The reason is obvious. If the OnCall person receives an alarm and needs to contact others, then the timeliness of the fault stop loss will be too poor.
So first of all, the alarms should be processed in different categories, and different people will OnCall different alarms. It is unreasonable to give all alarms to R&D or to operation and maintenance. This absolute approach is unreasonable.
There is a consensus on the ultimate goal, which is to allow business research and development to release versions freely, but we also hope to control it, hope to release safely, and hope to protect the business while releasing. Continuity. This puts extremely high requirements on the CI/CD system.
If you don't care, changing the bottom layer of the system is just a matter of running a script in batches on a batch of machines. But after adding the above requirements, it becomes much more difficult and becomes a systematic project.
On the business research and development side, it is necessary to make observable points and monitor the system to detect problems in time, and even automatically block the release process after an alarm. There needs to be some means of blue-green release and canary release, and some automatic code scanning and security scanning capabilities are needed. The tool system is incomplete. It is inappropriate to blindly require R&D to ensure that changes can be rolled back and that changes are safe. The level of CI/CD capabilities can basically tell the technical strength of the company.
If your company still provides R&D with bills of lading for operation and maintenance, and operation and maintenance operates online, you should consider whether this is reasonable. Of course, the above approach is more Internet-oriented and may not be suitable for all companies. This live broadcast only provides an idea, and you have to consider it yourself.
Of course, how to achieve this ideal situation? How should we go about it step by step before this ideal situation is achieved? The issue of time was not discussed in the live broadcast. If the company's business is suitable for running on Kubernetes, it is relatively easy to build such a system using Kubernetes, and you can take action as soon as possible. If the company's business must run in a physical machine or virtual machine environment, then first create a unified change release platform, and then fill in the gaps and gradually improve them.
The two guests didn’t talk much, but everyone was very cautious about this matter. Remind everyone:
At this stage, the platform system is not so complete yet, use the self-service Platform COE The architecture of BP (Business Partner) to build an operation and maintenance system seems to be reliable and implementable. In the future, when the Platform is good enough, BP manpower can be reduced (BP has gradually gained the ability to do COE). If the Platform continues to be complete, COE can continue to be reduced. After that, well, operation and maintenance and R&D may not be needed.
The above is the detailed content of To end this topic: Is it true that operation and maintenance jobs can no longer be done?. For more information, please follow other related articles on the PHP Chinese website!