Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games-AI-php.cn

2022 is a big year for AI, and also for data competitions, with total prize money across all platforms exceeding $5 million.

Recently, the machine learning competition analysis platform ML Contests conducted a large-scale statistics on the 2022 data competition. New report takes a look at all the noteworthy happenings in 2022. The following is a compilation of the original text.

Highlights:

Tool selections for successful contestants: Python, Pydata, Pytorch, and gradient-boosted decision trees.
Deep learning has not yet replaced gradient-boosted decision trees, although the former often increases in value when getting acquainted with boosting methods.
Transformers continue to dominate NLP and are beginning to compete with convolutional neural networks in computer vision.
Today’s data competitions cover a wide range of research areas, including computer vision, NLP, data analysis, robotics, time series analysis, etc.
Large ensemble models are still common among winning solutions, and some single-model solutions can also win.
There are multiple active data competition platforms.
The data competition community continues to grow, including in academia.
About 50% of the winners are one-person teams, and 50% of the winners are first-time winners.
Some people use high-end hardware, but free resources like Google Colab can also win the game.

Contests and Trends

The competition with the largest prize money is Drivendata’s Snow Cast Showdown Contest sponsored by the U.S. Bureau of Reclamation. Participants receive $500,000 in prize money and are designed to help improve water supply management by providing accurate snowwater flow estimates for different regions across the West. As always, Drivendata has written a detailed article on the matchup and has a detailed solution report that is well worth a read.

The most popular competition of 2022 is Kaggle’s American Express Default Prediction competition, which aims to predict whether customers will repay their loans. More than 4,000 teams competed, with $100,000 in prize money distributed to the top four teams. For the first time this year, a first-time entry was won by a one-person team using an ensemble of neural networks and LightGBM models.

The largest independent competition is Stanford University’s AI Audit Challenge, which offers a $71,000 reward pool for the best “models, solutions, datasets, and tools.” To find ways to solve the problem of "illegal discriminatory AI review systems".

Three competitions based on financial predictions are all on Kaggle: JPX’s Tokyo Stock Exchange predictions, Ubiquant’s market predictions, and G-Research’s crypto predictions.

In comparisons in different directions, computer vision accounts for the highest proportion, NLP ranks second, and sequential decision-making problems (reinforcement learning) are on the rise. Kaggle responded to this growth in popularity by introducing simulation competitions in 2020. Aicrowd also hosts many reinforcement learning competitions. In 2022, 25 of those Interactive events totaled more than $300,000.

In the official NeurIPS 2022 competition Real Robot Challenge, participants must learn to control a three-fingered robot to move a cube to a target location or position it at a specific point in space, and Be facing the right direction. Participants' strategies are run on the physical robot every week, and the results are updated on the leaderboard. The award is a $5,000 prize and the academic honor of speaking at the NeurIPS Symposium.

Platform

Although people are familiar with Kaggle and Tianchi, there are currently many machine learning competition platforms that form an active ecosystem.

The picture below shows the 2022 platform comparison:

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Give some examples:

Kaggle is one of the most established platforms, it was acquired by Google in 2017 and has the largest community, recently attracting 10 million users. Running competitions with prizes on Kaggle can be very expensive. In addition to hosting competitions, Kaggle also allows users to host datasets, notes, and models.
Codalab is an open source competition platform maintained by the University of Paris - Saclay. Anyone can register, host or participate in a contest. It provides free CPU resources for inference that competition organizers can supplement with their own hardware.
Zindi is a smaller platform with a very active community focused on connecting institutions with data scientists in Africa. Drivendata focuses on social impact competitions and has developed competitions for NASA and other organizations. Competitions are always followed by in-depth research reports.
Aicrowd started as a research project at the Swiss Federal Institute of Technology (EPFL) and is now one of the top five competition platforms. It hosts several official NeurIPS competitions.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Academia

Most of the prize money for competitions run on large platforms from industry, but machine learning competitions clearly have a richer history in academia, as Isabelle Guyon discussed in her NeurIPS invited talk this year.

NeurIPS is one of the most prestigious academic machine learning conferences in the world. The most important machine learning papers in the past decade are often presented at the conference, including AlexNet, GAN, Transformer and GPT-3.

NeurIPS first held the Data Challenge in Machine Learning (CIML) workshop in 2014, and there has been a competition component since 2017. Since then, the competition and total prize money have continued to grow, reaching nearly $400,000 in December 2022.

Other machine learning conferences also host competitions, including CVPR, ICPR, IJCAI, ICRA, ECCV, PCIC, and AutoML.

Prizes

About half of all machine learning competitions have prize pools of over $10,000. There is no doubt that many interesting competitions have small prizes, and this report only considers those with monetary prizes or academic honors. Often, data competitions associated with prestigious academic conferences provide the winners with travel grants to attend the conference.

While some tournament platforms do tend to have larger prize pools on average than others (see platform comparison chart), many platforms are hosting at least one prize pool in 2022 Very Big Competitions - The top ten competitions by total prize money include those run on DrivenData, Kaggle, CodaLab and AIcrowd.

How to win

This survey analyzes the techniques used by the winning algorithm through questionnaires and code observation.

Quite consistently, Python was the language of choice for the contest winners, which may not be an unexpected result for people. Of those who use Python, about half primarily use Jupyter Notebook, and the other half use standard Python scripts.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

A winning solution using mostly R is: Amir Ghazi won on Kaggle to predict the 2022 American Men’s College Basketball tournament winner's game. He did this by using — apparently copying verbatim — code from a 2018 competition-winning solution written by Kaggle Grandmaster Darius Barušauskas. Unbelievably, Darius also competed in this race in 2022, using a new approach and finishing 593rd.

Python Packages Used by Winners

When looking at the packages used in the winning solutions, the results showed that all winners using Python to some extent PyData stack.

The most popular software packages are divided into three categories - core toolkits, NLP categories and computer vision categories.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Among them, the growth of the deep learning framework PyTorch has been stable, and its jump from 2021 to 2022 is very obvious: PyTorch has gone from being the winning solution to 77% increased to 96%.

Of the 46 winning solutions using deep learning, 44 used PyTorch as their primary framework and only two used TensorFlow. Even more tellingly, one of the two competitions won using TensorFlow, Kaggle's Great Barrier Reef Competition, offers an additional $50,000 in prize money to the winning team using TensorFlow. Another competition won using TensorFlow used the high-level Keras API.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

While there were 3 winners using pytorch-lightning and 1 using fastai - both were built on PyTorch above - but the vast majority of people use PyTorch directly.

It may now be said that at least in the data race, PyTorch has won the machine learning framework battle. This is consistent with broader machine learning research trends.

Notably, we found no instances of the winning team using other neural network libraries, such as JAX (built by Google and used by DeepMind), PaddlePaddle (developed by Baidu) or MindSpore (developed by Huawei).

Computer Vision

Tools have a tendency to dominate the world, but technology does not. At CVPR 2022, the ConvNext architecture was introduced as the “ConvNet of the 2020s” and proved to outperform recent Transformer-based models. It was used in at least two competition-winning computer vision solutions, and CNN overall remains the most popular neural network architecture among computer vision competition winners to date.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Computer vision is very similar to language modeling in the use of pre-trained models: on public datasets such as ImageNet ), easy-to-understand architecture trained on. The most popular repository is Hugging Face Hub, accessible through timm, which makes it extremely convenient to load pre-trained versions of dozens of different computer vision models.

The advantages of using pre-trained models are obvious: real-world images and human-generated text have some common characteristics, and using pre-trained models can bring common sense knowledge, similar to Yu used a larger and more general training data set.

Typically, pre-trained models are fine-tuned – and further trained – based on task-specific data (such as data provided by competition organizers), but not always. The winner of the Image Matching Challenge used a pre-trained model without any fine-tuning at all - "Due to the (different) quality of the training and test data in this competition, we did not fine-tune using the provided training because we thought it would Not very effective." The decision paid off.

So far, the most popular pre-trained computer vision model type among the 2022 winners is EfficientNet, which, as the name suggests, has the advantage of being less resource intensive than many other models.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Natural Language Processing

Transformer-based models have dominated natural language processing since their inception in 2017 The field of language processing (NLP). Transformer is the "T" in BERT and GPT, and is also the core of ChatGPT.

So it’s no surprise that all winning solutions in natural language processing competitions have Transformer-based models at their core. It’s no surprise that they are all implemented in PyTorch. They all used pre-trained models, loaded using Hugging Face’s Transformers library, and almost all used Microsoft Research’s version of the DeBERTa model – usually deberta-v3-large.

Many of them require large amounts of computing resources. For example, the Google AI4Code winner ran an A100 (80GB) for approximately 10 days to train a single deberta-v3-large for their final solution. This approach is the exception (using a single master model and a fixed train/evaluation split) - all other solutions make heavy use of ensemble models, and almost all use some form of k-fold cross-validation. For example, the winner of the Jigsaw Toxic Comments contest used a weighted average of the outputs of 15 models.

Transformer-based ensembles are sometimes used in conjunction with LSTM or LightGBM, and there are also at least two instances of pseudo-labeling that were effectively used for the winning solution.

XGBoost was once synonymous with Kaggle. However, LightGBM is clearly the favorite GBDT library for the 2022 winners - winners mentioned LightGBM as many times in their solution reports or questionnaires as CatBoost and XGBoost combined, CatBoost came in second, and XGBoost surprisingly ranked third.

Compute and Hardware

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

##As roughly expected, most winners used GPUs for training— — This can greatly improve the training performance of gradient boosted trees and is actually required for deep neural networks. A significant number of award recipients have access to clusters provided by their employer or university, often including GPUs.

Somewhat surprisingly, we didn’t find any instances of using Google’s Tensor Processing Unit, the TPU, to train a winning model. We also didn’t see any winning models trained on Apple’s M-series chips, which have been supported by PyTorch since May 2022.

Google's cloud notebook solution Colab was popular, with one winner on the free plan, one on the Pro plan, and another on Pro (we can't confirm the fourth winner) or using the package used by Colab).

Local personal hardware was more popular than cloud hardware, and although nine winners mentioned the GPU they used for training, they did not specify whether they used a local or cloud GPU.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

The most popular GPU is the latest high-end AI accelerator card NVIDIA A100 (here A100 40GB and A100 80GB are placed together, since the winner can't always tell the difference), and often multiple A100s - for example, the winner of Zindi's Turtle Recall competition used 8 A100 (40GB) GPUs, and the other two winners used 4 A100.

Team Formation

Many competitions allow up to 5 entrants per team, teams can consist of individuals or smaller teams at some point before the results submission deadline "Merge" them together before the deadline.

Some competitions allow for larger teams, for example, Waymo’s Open Data Challenge allows up to 10 people per team.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Conclusion

This is a rough look at the 2022 machine learning competition. Hope you can find some useful information in it.

There are many exciting new competitions in 2023, and we look forward to releasing more insights as they wrap up.

The above is the detailed content of Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games. For more information, please follow other related articles on the PHP Chinese website!