Paper link: https://arxiv.org/pdf/2105.10375.pdf
Application & Code:
Image classification is one of the most successful practical application technologies of AI at present, and has been integrated into people's daily life. It is widely used in most computer vision tasks, such as image classification, image search, OCR, content review, recognition authentication and other fields. There is a general consensus: "When the data set is larger and there are more IDs, as long as it is properly trained, the effect of the corresponding classification task will be better." However, when faced with tens of millions of IDs or even hundreds of millions of IDs, it is difficult for the currently popular DL framework to directly conduct such ultra-large-scale classification training at low cost.
The most intuitive way to solve this problem is to consume more graphics card resources through clustering, but even so, the classification problem under massive IDs will still have the following problems:
1) Cost issue: In the case of massive data in the distributed training framework, memory overhead, multi-machine communication, data storage and loading will consume more resources.
2) Long tail problem: In actual scenarios, when the data set reaches hundreds of millions of IDs, the number of image samples in most of the IDs will often be very small, and the data will be distributed in a long tail. It is very obvious that direct training is difficult to achieve better results.
The remaining chapters of this article will focus on the existing solutions for ultra-large-scale classification frameworks, as well as the corresponding principles and tricks of the low-cost classification framework FFC.
Before introducing the method, this article first reviews the main challenges of current ultra-large-scale classification:
Challenge point 1: The cost remains high
The larger the number of IDs, the greater the memory requirements of the classifier, as shown in the following diagram:
The larger the video memory, the more machine cards are required and the higher the cost. The corresponding hardware infrastructure cost for multi-machine collaboration is also higher. At the same time, when the number of classification IDs reaches an extremely large scale, the main calculation amount will be wasted on the last layer of classifiers, and the time consumed by the skeleton network is negligible.
Challenge point 2: Difficulty in long-tail learning
In actual scenarios, the absolute majority among hundreds of millions of IDs The number of image samples in most IDs will be very small, and the long-tail data distribution is very obvious, making direct training difficult to converge. If trained with equal weights, long-tail samples will be overwhelmed and insufficiently learned. At this time, imbalanced samples are generally used. On this research topic, there are many methods that can be used for reference. What method is more suitable to integrate into the simple ultra-large-scale classification framework?
With the above two challenges, let’s first take a look at what existing feasible solutions are available and whether they can solve the above two challenges well.
Feasible method 1: metric learning
##Feasible Method 2: PFC framework
Feasible method 3: VFC framework
##Method of this paper: FFC framework
The loss function when training with FC for large-scale classification is as follows:
During each backtransmission process, all class centers will be updated:
But FC is too big. The intuitive idea is to reasonably select a certain proportion of class centers, that is, Vj is 1 part as follows:
Due to the above motivation , leading to the following preliminary plan:
First of all, in order to solve the impact of the long tail, this article introduces two loaders, one based on id There are two loaders, the sampling id_loader and the instance_loader based on sample sampling. In each epoch, classes with many samples and classes with few samples (few-shot) can have the opportunity to be trained.
Secondly, before training starts, send a part of the samples to the id group. Here, it is assumed that 10% of the id samples are put into the group. At this time, gallery uses random parameters.
Then, when training starts, the batch samples enter the probe net one by one. Then there are two situations for the samples in each batch: 1.) There are features with the same ID of this sample in the group, 2.) There are no features of similar samples in the group. For these two cases, call them existing id and fresh id respectively. For existing samples, use the feature and the feature in the group to do the inner product, calculate the cross-entropy loss function with the label, and then return it. For fresh samples, minimize the cosine similarity with the samples in the group.
Finally, update the features in the group and replace them with new class centers, based on the principle of weighting existing class centers. For gallery net, the moving average strategy is used to gradually update the parameters in the probe.
Method of this paper: Trick introduction
##1.) The size of the introduced ID Group is adjustable Parameter, generally defaults to 30,000.2.) In order to achieve stable training, refer to the moco class method and introduce moving average. The corresponding convergence conditions are:
Experimental results
1. Double Loader ablation experiment
2. Comparison of SOTA method effects
3. Comparison of video memory and sample throughput
The above is the detailed content of DAMO Academy's open source low-cost large-scale classification framework FFC. For more information, please follow other related articles on the PHP Chinese website!