Duxiaoman automatic machine learning platform practice-AI-php.cn

Duxiaoman automatic machine learning platform practice

With the development of AI technology, the AI technologies involved in different businesses are becoming more and more diverse. At the same time, the number of AI model parameters is growing explosively year by year. How Overcoming the problems faced by AI algorithm implementation such as high development costs, strong reliance on manual labor, unstable algorithms, and long implementation cycles have become problems that plague artificial intelligence practitioners. The "automatic machine learning platform" is the key method to solve the pressure of AI implementation. Today I will share with you Du Xiaoman’s practical experience in building the automatic machine learning platform ATLAS.

1. Machine Learning Platform

First introduce the background and development of Du Xiaoman Machine Learning Platform process and current situation.

1. Business scenario

Duxiaoman automatic machine learning platform practice

## Degree Xiaoman is a financial technology company. The company's internal business scenarios are mainly divided into three aspects:

Intelligent risk control: involving relational data mining, text NLP technologies such as data mining and sentiment analysis, and CV technologies such as face recognition.
Intelligent customer acquisition: involves common advertising customer acquisition technologies, such as personalized pricing, personalized recommendations, creative advertising and user portraits.
Intelligent management: involving AI algorithm technologies such as graph neural network, causal inference and OCR.

Because the AI technologies involved in the business are very diverse, it brings great challenges to the implementation of AI algorithms.

#2. Problems with the implementation of AI algorithms

Duxiaoman automatic machine learning platform practice

There is a problem with the implementation of AI algorithms Impossible triangle: It is difficult to achieve high efficiency, low cost and high quality of algorithm development at the same time.

The cost of AI algorithm is high: First of all, the threshold for AI algorithm development is very high. For example, OCR technology and facial recognition technology used in financial risk control have high entry barriers. Not all companies can master them, and only a few companies can achieve the top level. Secondly AI algorithm consumes a lot of hardware resources. For example, training a face recognition model or a large-scale NLP model requires investing a lot of computing resources, which is an invisible threshold.
AI algorithm effect is unstable: AI algorithm effect depends on expert experience, different people will produce different results, and there are instability factors.
AI algorithm delivery cycle is long: a mature algorithm from development to deployment and onlineThe development cycle can easily last several months, and there are many Sometimes it is necessary to do customized development based on rich business scenarios, and in the end there is a gap between the output of the model and the implementation of the algorithm.

Faced with these problems of AI implementation, I think the only solution is to use a machine learning platform.

3. AI algorithm production process

Let’s understand the AI algorithm implementation process from the AI algorithm production process specific difficulties encountered.

AI algorithm implementation is mainly divided into four parts: data management, model training, algorithm optimization, and deployment and release. Model training and algorithm optimization are an iterative process.

In each step of algorithm development, the technical requirements for those involved in that step vary greatly:

Data Management: Technical personnel are required to understand the business, as well as data governance and big data related technologies.
Model training and algorithm optimization: It is required to understand the basic principles of algorithm models and have experience in algorithm optimization.
Deployment and Release: Requires familiarity with back-end development and web server related technologies.

It can be seen from the technology stack required for each step that it is difficult for one or two or three technicians to fully master all the technologies, and every Every step involving manual labor is a production bottleneck that causes unstable production. The use of machine learning platforms can solve these two problems.

4. Machine learning platform ATLAS

Our machine learning platform ATLAS runs through the entire process of AI production, aiming to Human participation in the process of replacing AI algorithm implementation can achieve the goal of efficient output and increase the efficiency of AI algorithm research and development.

Duxiaoman automatic machine learning platform practice

##ATLAS involves the following four platforms:

Annotation platform: responsible for producing annotation data available for training;
Data platform: responsible for large-scale data governance;
Training platform: Responsible for model training optimization;
Deployment platform: Provides efficient, low-cost, high-availability Deployment plan.

#There is also an iterative relationship between these four platforms. The design details and operation processes of these platforms are introduced below.

5. ATLAS: Data and Training

Duxiaoman automatic machine learning platform practice

##The data and training section covers annotation platform, data platform and training platform.

(1) Annotation platform

The annotation platform is mainly provided for the training of AI algorithms Label data. Since the birth of deep learning, models have become highly complex. The bottleneck of AI algorithm effects has shifted from model design to data quality and quantity. Therefore, efficient production of data is crucial in the implementation of AI algorithms. link.

ATLAS’s data annotation platform mainly has two capabilities and features: multi-scenario coverage and intelligent annotation.

Multi-scenario coverage: ALTAS annotation platform covers all algorithm scenarios involved in the company’s internal business. Including text detection, text recognition, face detection and face comparison in OCR, image classification in the CV field, and data annotation for tasks such as entity relationships and text labels in the NLP field.
Intelligent annotation: In order to improve the efficiency of data annotation, the ATLAS annotation platform also provides intelligent annotation. Includes intelligent pre-annotation and intelligent error correction. Intelligent pre-labeling refers to using a trained model to label data in advance, and then manually reviewing it to improve labeling efficiency. Intelligent error correction refers to referring to pre-labeled results and starting a re-labeling process for low-confidence labeling results.

(2) Data platform

The data platform mainly achieves large-scale Data governance can take into account flexibility during the governance process and dynamically match samples. On the basis of saving more than 5000 dimensions of features of hundreds of millions of users, an online real-time query can be achieved. Dynamic matching samples can meet the sample selection and data selection requirements of different scenarios.

(3) Training platform

The training platform is a very important facility. It is divided into five layers:

## Scheduling layer: The training platform involves the management of hardware resources, and the lowest scheduling layer is responsible for the scheduling of these resources.
Control layer: Above the scheduling layer is the control layer, which realizes asynchronous and distributed workflow control by turning machine learning tasks into some workflows .
Functional layer: Implements some basic functions of the machine learning platform, including AutoML, parallel computing, and graph computing.
Application layer: Based on some underlying technical capabilities, the technology and capabilities of AI algorithm development are packaged into various specific function pipelines.
User layer: Above the application layer, users can use these preset basic functions to complete their work goals.

6. ATLAS: Deployment and online

Duxiaoman automatic machine learning platform practice

our The deployment adopts a serverless-like architecture. The reason why it is said to be serverless-like is that it is not a completely serverless service. Because our services are not oriented to a wide range of general application scenarios, but only to online services of models, there is no need to consider compatibility with more scenarios.

The API interface layer provides three parts that the model will come into contact with:

Basic feature processing
Prediction of the model itself
For the ability to access external data

For users, only the orange part in the picture needs to be concerned about. The API provided by the platform can reduce development low cost and compatible with almost all algorithms on the market. Using the API to develop a model, the process from development to implementation can be completed within a day or even half a day. On top of this, we can provide good stability guarantee, traffic management and capacity management for the platform through cluster management.

7. ATLAS: Optimization Iteration

The following demonstrates the scenarios of two optimization iterations on ATLAS.

Duxiaoman automatic machine learning platform practice

Scenario 1: Continuous iteration of the online model

For example During the implementation of an OCR model, some bad cases will be generated after the old model is deployed. These bad cases will be merged with the existing annotation data to become a new data set. The old model will then be optimized through the AutoML optimization pipeline to generate a new model. After deployment, the cycle repeats. Through such a cycle, the model can maintain an additional 1% improvement in accuracy. Since the accuracy of the OCR model is very high, generally above 95%, 1% is also a big improvement.

Scenario 2: AutoML guided optimization

For simplicity Repeated optimization processes are replaced by full-process AutoML. AutoML is used as auxiliary optimization for scenarios that require expert experience. The results of full-process AutoML are used as Baseline to select the optimal model for deployment and online. In our company, more than 60% of the scenarios have achieved performance improvements through this optimization method, with the improvement effects ranging from 1% to 5%.

2. Automatic machine learning

#The following introduces what AutoML technologies we use and what we do improvement of.

1. Expert modeling and AutoML

First introduce what AutoML has compared to traditional expert modeling Advantage.

Duxiaoman automatic machine learning platform practice

The advantages of AutoML are divided into three aspects:

In terms of efficiency: AutoML can greatly shorten the AI algorithm development cycle, and can produce models in one day that traditional expert modeling can take months to produce;
Threshold aspect: AutoML can reduce or completely eliminate the need for expert participation, lowering the research and development threshold;
#Stability aspect: Expert modeling dependency Due to manual experience, the optimization effect is unstable, but AutoML can eliminate the influence of experience and fully tap the potential of the algorithm.

#2. Introduction to AutoML

Let’s introduce the technologies commonly used in AutoML.

Duxiaoman automatic machine learning platform practice

##AutoML commonly used technologies include three aspects:

Super parameter optimization: the most commonly used are black box optimization and multi-fidelity optimization;
Meta-learning : Learning from the nature of the task or a priori model;
#Network structure search: specifically for neural network scenarios, involving different search spaces and search algorithms.

In the actual work process, different technologies will be selected according to different task scenarios, and these technologies can be used jointly.

#The following sections introduce these three technologies respectively.

3. Automatic machine learning platform: automatic optimization pipeline

Duxiaoman automatic machine learning platform practice

The first is the hyperparameter optimization part. In fact, in our automatic optimization pipeline, the entire machine learning pipeline is used as the target of automatic optimization, not just for hyperparameter optimization. Including automated feature engineering, model selection, model training and automatic integration, etc., this reduces the possibility of over-fitting compared to individual hyper-parameter optimization.

In addition, we have implemented an AutoML framework, Genesis, to be compatible with mainstream AI algorithms and AutoML tools, and is friendly to expansion. It can orthogonalize different capability modules in the platform to each other, making them They can be freely combined to achieve a more flexible automatic optimization pipeline.

4. Automatic machine learning platform: meta-learning system

Duxiaoman automatic machine learning platform practice

is also used in our system Now that the meta-learning method has been introduced, let’s introduce the necessity of the meta-learning method and the key scenarios of its application.

(1) The necessity of meta-learning

After accumulating a large amount of experimental data Afterwards, we found that the data sets showed obvious aggregation in the meta-feature space, so we assumed that the optimal solutions of data sets with close distribution in the meta-feature space would also be close. Based on this assumption, we used the hyperparameters of historical tasks to guide the parameter optimization of new tasks and found that the hyperparameter search converges faster, and under a limited budget, the algorithm effect can be improved by an additional 1%.

(2) Application scenario

Scenario 1: Existing data Set derivation

In big data application scenarios, it is sometimes necessary to merge existing data sets, such as data set A and data set B merges to generate a new data set C. If the hyperparameters of data set A and data set B are used as the cold start of data set C to guide the hyperparameter optimization of data set C, on the one hand, the search space can be locked , on the other hand, it canachieve the optimal parameter optimization results.

Scenario 2: Data set repeated sampling

In the actual development process, it is sometimes necessary to sample the data set, and then perform hyperparameter optimization on the sampled data set. Because the meta-feature space distribution of the sampled data is close to the original data, the original data set is used Using the hyperparameters to guide the hyperparameter optimization of sampled data can improve optimization efficiency.

5. Automatic machine learning platform: deep learning optimization

Duxiaoman automatic machine learning platform practice

Finally It is our automatic optimization for deep learning scenarios. It is divided into two aspects: hyperparameter optimization and exploration of NAS:

Deep learning hyperparameter optimization

The development bottleneck of deep learning is training time. One iteration takes hours to days. Then using traditional Bayesian optimization requires twenty or thirty iterations, and the training time is as long as one Month to several months. Therefore, we will use the Hyperband method to provide seeds for Bayesian optimization in the deep learning hyperparameter optimization part to accelerate the hyperparameter search process. On this basis, we will also use historical data information to optimize cold start and use historical alternative models for integration, which will achieve a global optimal solution at a faster convergence speed than random initialization.

NAS Exploration

In actual development scenarios, different deployment scenarios have different effects on model size and time Performance requirements are different. Secondly, the optimization of neural network structure is an important part of model optimization. We need to eliminate manual interference in this step. So we proposed this one-shot NAS method based on weight entanglement. The search efficiency can reach more than 3 times that of the classic DARTS method, and the parameter amount and calculation cost of the searched subnetwork model are controllable. We An appropriate model can be selected within the target. In addition, we also support various spaces such as MobileNet and ResNet to meet different CV task requirements.

3. Scale and efficiency

Finally, let’s discuss the issues we encountered during the construction of the machine learning platform Issues of scale and efficiency.

1. Dilemma of Deep Learning

#The reason why we pay attention to the issues of scale and efficiency is because deep learning faces This represents a conflict between model size and computational requirements.

Duxiaoman automatic machine learning platform practice

It is an industry consensus that more model parameters mean better model performance. There is the following Moore's Law in deep learning:

The model parameter size increases ten times a year
Hardware performance doubled in 18 months

So the gap between the rapidly growing computing needs and hardware performance must be overcome Optimize to solve it.

2. Data parallelism

The most commonly used optimization method is parallelism, including data parallelism, model parallelism, etc. The most commonly used is data parallel technology.

Duxiaoman automatic machine learning platform practice

The data parallel technology of the ATLAS platform has the following characteristics:

As the basic function of ATLAS, it can be used without any sense;
can not only support artificial neural network models , also supports Boosting models, such as XGB, LGBM, etc.;
Supports optimized parallel efficiency of multiple communication architectures;
In terms of optimization effect, for both the neural network model and the Boosting model, the throughput has been linearly improved, and for the neural network model, parallel training increases the trainable scale and can speed up the convergence speed. Ultimately, model accuracy can be improved.

#3. Model parallelism

There are also some models that cannot solve the problem of training efficiency by relying solely on data parallelism. , it is also necessary to introduce model parallel technology.

Duxiaoman automatic machine learning platform practice

ATLAS model parallelism is mainly divided into two aspects:

Scenario 1: Intra-layer parallelism

The parameter scale of the fully connected layer of some network models is very large For example, the classification scale of arcFace reaches dozens, millions or even tens of millions. Such a fully connected layer cannot be covered by a GPU card. At this time, intra-layer parallel technology needs to be introduced, and different nodes calculate different parts of the same tensor.

Scenario 2: Inter-layer parallelism (pipeline parallelism)

At the same time, inter-layer parallel technology will also be used, that is, data of different layers of the network are calculated on different nodes, and non-dependent calculations are pre-empted to reduce IDLE (GPU waiting time) during the calculation process.

4. Graph parallelism

#In addition to linear data that can be described by tensors, we have made some graphs Exploration of data parallel training.

Duxiaoman automatic machine learning platform practice

For graph data, whether sampling or other operations need to dynamically cross nodes, and graph data is generally large in size They are all very large. Our internal graph data has reached tens of billions. It is difficult to complete the calculation of such graph data on a single machine.

The bottleneck of distributed computing of graph data lies in the mapping table. The space complexity of the traditional mapping table is O(n), such as 1 billion points 1 billion The edge graph data occupies 160GB of memory, forming a scale ceiling for distributed training. We propose a method with a space complexity of O(1). By rearranging the IDs of nodes and edges, only the mapping boundaries are retained, so that the scale of graph parallel training can be arbitrarily expanded.

5. Training efficiency

At the same time, we have also made some optimizations in terms of training efficiency.

Duxiaoman automatic machine learning platform practice

GPU utilization optimization

A lot of GPU time is spent reading data, and the GPU is idle. Through pre-training, in-process monitoring and early warning, and post-event analysis, the average GPU usage can be doubled.

Backpropagation recalculation

We also use Backpropagation recalculation technology is adopted. For some models with very many parameters, during the forward propagation process, we do not save the calculation results of all layers, but only retain the checkpoints of some nodes. During the backward propagation, the blank parameter nodes are recalculated from the checkpoint. In this way, memory resources can be reduced by more than 50%, and training efficiency can be improved by more than 35%.

##4. Summary and Thoughts

Duxiaoman automatic machine learning platform practice

## Finally, let’s talk about our experience and thoughts in the construction of machine learning platform.

We have summarized some experiences as follows:

First of all, The machine learning platform is the most effective solution for the implementation of our AI algorithm;

##Because the implementation of our AI algorithm involves all aspects of technology and content, it is impossible for me to require students in any link to understand the whole situation, so we must have a platform that can provide these foundations capabilities to help everyone solve these problems.

Secondly, we believe that the application of AutoML is the core construction capability of the machine learning platform;

Because only if the application of automation or AutoML is done well, can the productivity of algorithm experts be more effectively liberated, allowing algorithm experts to do some more in-depth algorithms. , or capacity building to increase the upper limit of machine learning.

Finally, in this design process, because it is more about considering an internal application scenario, our functions and The design of capabilities will be based on our business actualities and give priority to meeting business requirements.

Looking to the future:

First of all, the capabilities of ATLAS will be adapted to Equip more scenes to achieve higher efficiency.
Secondly, we will explore the application of training optimization technologies such as 3D parallelism on ultra-large-scale language models, so that our algorithm effects can match the level of the industry's cutting-edge AI algorithms. near.

5. Q&A session

Q1: Which open source AutoML frameworks have we tried and which ones do we recommend?

A1: The most commonly used open source AutoML framework is Optuna. I have also tried Auto-Sklearn and AutoWeka. Then I recommend a website to everyone, which is automl.org. Because there are actually relatively few people working in this field now, this website is a website built by several experts and professors in the field of AutoML. It has a lot of open source learning materials for AutoML that everyone can refer to. The open source framework we recommend is Optuna, which we use for parameter adjustment, because its algorithm is not just the most basic Bayesian optimization. It is a TPE algorithm, which is more suitable for very large parameters. In some scenarios, Bayesian optimization is still more suitable for some scenarios with fewer parameters. However, my suggestion is that you may try some different methods for different scenarios, because after more attempts, you may have different opinions on what scenarios What method is suitable for more experience.

#Q2: How long is the development cycle of the machine learning platform?

#A2: It has been 3 or 4 years since the construction of our machine learning platform. At the beginning, we first solved the problem of deploying applications, and then started to build our production capabilities, such as computing and training. If you are building it from scratch, I suggest you refer to some open source frameworks to build it first, and then see what kind of problems you will encounter in your own business scenarios during use, so as to clarify the future development direction.

#Q3: How to eliminate overfitting during cross-validation?

A3: This may be a more specific algorithm optimization problem, but in our optimization pipeline, we train through the sampling method. Through this This method allows our task to see more angles, or aspects, of the data set, and then integrates the top models trained after these samplings to give our model stronger generalization capabilities. This is It is also a very important method in our scene.

Q4: How much is the development cycle and personnel investment when we build the entire machine learning learning platform?

A4: This development cycle has just been mentioned, and it takes about three to four years. Then, in terms of personnel investment, there are currently six or seven students. In the early days, the number was even less than this.

#Q5: Will virtualizing GPU improve the machine learning platform?

#A5: First of all, the virtualized GPU mentioned by this student should refer to the segmentation and isolation of resources. If we are building a machine learning platform, virtualizing GPU should be a necessary capability. This means that we must virtualize resources in order to achieve better resource scheduling and allocation. Then on this basis, we may also divide our GPU's video memory and its computing resources, and then allocate resource blocks of different sizes to different tasks, but we are not actually training at this point. Use it inside, because training tasks usually have higher requirements for computing capabilities and will not be a smaller resource consumption application scenario. We will use it in inference scenarios. During the actual application process, we found that there is no good open source free solution for virtualization technology. Some cloud service vendors will provide some paid solutions, so we use a time-sharing multiplexing solution for deployment, which is Mixing some tasks with high computing requirements and some tasks with low computing requirements to achieve time-sharing multiplexing can achieve the effect of increasing capacity to a certain extent.

Q6: What is the acceleration ratio of multi-node distributed parallel training? Can it be close to linear?

#A6: We can get close to a linear acceleration ratio. If we measure it ourselves, it can probably reach a level of 80 to 90 when it is better. Of course, if the number of nodes is very large, further optimization may be required. Now you may publish papers or see papers mentioning that 32 or 64 nodes can achieve an acceleration ratio of 80 or 90. That may be necessary. There are some more specialized optimizations. But if we are in the machine learning platform, we may need to target a wider range of scenarios. In actual scenarios, most training may require 4 GPU cards or 8 GPU cards, and up to 16 GPU cards can meet the requirements. .

Q7: What parameters do users need to configure when using AutoML? How much computing power and time does the entire calculation require?

#A7: In the ideal situation of our AutoML, users do not need to configure any parameters. Of course, we will allow users to adjust or determine some parameters according to their needs. In terms of time consumption, for all our AutoML scenarios, our goal is to complete the optimization within one day. And in terms of computing power, if it is general big data modeling, such as tree models XGB, LGBM, etc., even a single machine can handle it; if it is a GPU task, it depends on the scale of the GPU task itself. Basically AutoML training can be completed with 2 to 3 times the computing power of the original training scale.

#Q8: What open source machine learning frameworks can you refer to?

#A8: This question was mentioned just now. You can refer to Optuna, Auto-Sklearn and AutoWeka. And I just mentioned the website automl.org. It has a lot of information on it. You can go and learn about it.

Q9: What is the relationship with EasyDL?

A9: EasyDL is owned by Baidu, and our framework is completely self-developed.

That’s it for today’s sharing, thank you all.

The above is the detailed content of Duxiaoman automatic machine learning platform practice. For more information, please follow other related articles on the PHP Chinese website!