The progress of natural language processing in recent years has largely come from large-scale language models. Each new model released pushes the amount of parameters and training data to new highs, and will also Carry out a massacre of the existing benchmark rankings!
For exampleIn April this year, Google released the 540 billion parameter language model PaLM (Pathways Language Model) in language and reasoning It has successfully surpassed humans in a series of evaluations, especially its excellent performance in the few-shot small sample learning scenario, and PaLM is considered the development direction of the next generation language model.
Similarly, Visual language modelIn factis alsoStrong efforts can produce miracles , you can improve the performance by increasing the size of the model.
Of course, if is just a visual language model for multi-tasking , it is obviously not very universal, and it must support input in multiple languages Just output.
Recently, Google upgraded the PaLM extension to PALI (Pathways Language and Image model), which has both multi-language and image understanding capabilities , and supports 100 languages to perform a variety of image and language applications across vision, language and multi-modal, such as visual question answering, image caption (image caption), object detection, image classification, OCR, Text reasoning, etc.
##Paper link: https://arxiv.org/abs/2209.06794
The model is trained using a public image collection, which includes automatically crawled annotations in 109 languages, also called the WebLI data set in the article.
PaLI models pre-trained on WebLI achieve state-of-the-art performance on multiple image and language benchmarks, such as COCO-Captions, TextCaps, VQAv2, OK-VQA, TextVQA, etc. etc., also surpassed the benchmarks of multilingual visual captioning and visual question answering of previous models.
Model ArchitectureOne of the goals of PALI is to study language and visual models at performance and scale Are the connections on the same, especially the scalability of the language-image model?
So the architectural design of the model is very simple, mainly for the convenience of experiments, especially for reusability and scalability.
The model consists of a Transformer encoder that processes input text and an autoregressive Transformer decoder that generates output text.
When processing images, the input to the Transformer encoder also includes visual words representing the images processed by ViT.
A key design of the PaLI model is reuse. The researchers used the weights of previously trained single-modal vision and language models (such as mT5-XXL and large ViTs) as seeds for the model. ,This reuse not only transfers the capabilities of single-modal ,training, but also saves computational costs.
The visual component of the model uses the largest ViT architecture to date, ViT-e, which has the same structure as the 1.8 billion parameter ViT-G model, and Using the same training parameters, the difference is that it is expanded to 4 billion parameters.
Although the scaling rules have been studied in both the visual field and the language field, there is less discussion of scaling behavior in the combined model of vision and language. Expanding the scale of the visual backbone model may lead to saturation of gains in classification tasks.
The researchers also further confirmed this, and it can be observed that ViT-e is only slightly better than ViT-G on ImageNet, but ViT-e has a great improvement on the visual language task of PaLI. For example, ViT-e outperforms ViT-G by nearly 3 CIDEr points on the COCO subtitle task. 3 points more than ViT-G in tasks. This also hints at room for using larger ViT skeleton models in visual language tasks in the future.
The researchers adopted mT5 backbone as the language modeling component, using pre-trained mT5-Large (1 billion parameters) and mT5-XXL (13 billion parameters) to initialize PaLI’s language encoder-decoder and then continue hybrid training on many language tasks, including pure language understanding tasks, which also helps avoid catastrophic forgetting of mT5’s language understanding and generative capacity.
Finally, we got three PALI models of different sizes.
Extension research related to deep learning shows that the larger the model, the more training data required The set is also larger.
So in order to comprehensively study and release the potential of language-image pre-training models, researchers crawled a large amount of image and text data from the Internet and constructed a new data set WebLI , which includes 12 billion alt-texts and 10 billion images in 109 languages.
In addition to using network text for annotation, the researchers also used the cloud vision API to perform OCR recognition on images, thereby obtaining 29 billion images-OCR of data pairs.
Using near-duplication to deduplicate images from the training, validation and test portions of 68 common visual and visual language datasets to avoid data leakage in downstream evaluation tasks.
In order to further improve data quality, researchers will also score and adjust based on the cross-modal similarity of "image and alt-text" Threshold, and finally only retain 10% of the images. A total of 1 billion images are used to train PaLI
Since the visual-language task is multi-modal , so the model needs to have multiple semantic processing capabilities and have different goals. For example, some tasks require local localization of objects to accurately solve the task, while other tasks may require more global semantic information.
Similarly, some language tasks may require long answers, while others may require compact answers.
To resolve all these inconsistent goals, researchers leveraged the richness of WebLI pre-training data and introduced a Pretraining Task Mixture to prepare models for various downstream applications. .
In order to make the model more versatile to solve a variety of tasks, the author classified all tasks into a single common API (input: image text; output: text), making multiple images Knowledge sharing is possible between and language tasks, which is also shared with pre-training settings.
The targets used for pre-training are projected into the same API as a weighted mix, with the goal of both maintaining the ability to reuse model components while training the model to perform new tasks .
The model uses the open source T5X and Flaxformer frameworks and is trained with Flax in JAX. The visual part of ViT-e uses the open source BigVision framework to generate word vectors of the language part and the visual part. The patch vectors are cascaded and jointly used as the input of the multi-modal encoder-decoder. The encoder is initialized using mT5-XXL pre-training. During the training process of PaLI, the weights of the visual components are frozen and only the weights of the multimodal encoder-decoder are updated.
In the experimental part, the researchers compared PaLI on common visual language benchmarks, and the PaLI model achieved state-of-the-art results on these tasks, even exceeding the very large ones proposed in the previous literature. Model.
For example, the 17 billion parameter PALI performs better than the 80 billion parameter Flamingo model on some VQA and image captioning tasks.
And PALI also maintains good performance on single language or single visual tasks, although this is not the main training goal of PALI.
We also examine how the image and language model components interact in terms of model extensions, and where the model yields the greatest gains.
The final conclusion is that joint scaling (scaling) of these two components yields the best performance, specifically for visual components that require relatively few parameters Scaling is critical, but scaling is also important for improving performance on multilingual tasks.
After evaluating PaLI on the benchmark Crossmodal-3600 in 35 languages, it can be found that the multi-language title task benefits more from the extension of the PaLI model. many.
To avoid creating or reinforcing unfair bias in large language and image models, understanding of the data used and how the models use that data is required To maintain transparency, test the fairness of the model and conduct responsible data analysis, the article also provides a Data Card and Model Card
The above is the detailed content of Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages. For more information, please follow other related articles on the PHP Chinese website!