The quality of academic and commercial machine translation systems (MT) has improved dramatically over the past decade. These improvements are largely due to advances in machine learning and the availability of large-scale web mining datasets. At the same time, the emergence of deep learning (DL) and E2E models, large-scale parallel single-language data sets obtained from web mining, data enhancement methods such as back-translation and self-training, and large-scale multi-language modeling have brought about the ability to support more than 100 High-quality machine translation system for languages.
However, despite the huge progress in low-resource machine translation, the number of languages for which widely available and general machine translation systems have been built is limited to about 100, which are obviously only the most comprehensive ones today. A few of the more than 7,000 languages spoken in the world. In addition to the limited number of languages, the distribution of languages supported by current machine translation systems is also greatly tilted towards European languages.
We can see that despite their large populations, there are fewer services for languages spoken in Africa, South and Southeast Asia, and Native American languages. For example, Google Translate supports Frisian, Maltese, Icelandic, and Corsican, all of which have fewer than 1 million native speakers. By comparison, the Bihar dialect population not served by Google Translate is about 51 million, the Oromo population is about 24 million, the Quechua population is about 9 million, and the Tigrinya population is about 9 million (2022). These languages are known as "long tail" languages, and the lack of data requires the application of machine learning techniques that can generalize beyond languages with sufficient training data.
Building machine translation systems for these long-tail languages is largely limited by the lack of available digitized data sets and NLP tools such as language identification (LangID) models. These are ubiquitous for high-resource languages.
In a recent Google paper "Building Machine Translation Systems for the Next Thousand Languages", more than two dozen researchers demonstrated their efforts to build practical machines that support more than 1,000 languages. Translation system results.
Paper address: https://arxiv.org/pdf/2205.03983.pdf
Specific Specifically, the researchers describe their results from the following three research areas.
First, a clean, web-mined dataset is created for 1500 languages through semi-supervised pre-training for language recognition and data-driven filtering techniques.
Second, through large-scale multilingual models trained with supervised parallel data for more than 100 high-resource languages, as well as monolingual datasets for 1,000 additional languages. Create machine translation models that actually work for underserved languages.
Third, study the limitations of evaluation metrics for these languages, conduct a qualitative analysis of the output of machine translation models, and focus on several common error patterns of such models.
We hope this work will provide useful insights to practitioners working on building machine translation systems for currently under-researched languages. In addition, the researchers hope that this work can lead to research directions that address the weaknesses of large-scale multilingual models in data sparse settings.
At the I/O conference on May 12, Google announced that its translation system has added 24 new languages, including some niche Native American languages. For example, the Bihar dialect, Oromo, Quechua and Tigrinya mentioned above.
This work is mainly divided into four major chapters. Here we only discuss each chapter. The contents of each chapter are briefly introduced.
This chapter details the researcher’s efforts to crawl single-language text data for 1500 languages method used in the collection process. These methods focus on recovering high-precision data (i.e., a high proportion of clean, in-language text), so a large part are various filtering methods.
In general, the methods used by researchers include the following:
The following is a histogram of document consistency scores on web text using the CLD3 LangID model of 1745-language.
Table 2 below shows the single-language data of the complete low-resource language (LRL) data set, part of the single-language data used to train the model, and includes Single-language statistics for the complete training set including high-resource languages.
The chapter directory is as follows:
For monolingual data mined from the web, the next challenge is to create a high-quality general machine translation model from a limited amount of monolingual training data. To this end, the researchers adopted a pragmatic approach of leveraging all parallel data available for higher-resource languages to improve the quality of long-tail languages where only monolingual data is available. They call this setup "zero-resource" because there is no direct oversight for long-tail languages.
Researchers have used several techniques developed for machine translation over the past few years to improve the quality of zero-resource translation of long-tail languages. These techniques include self-supervised learning from monolingual data, large-scale multilingual supervised learning, large-scale back-translation, and self-training of high-capacity models. They used these tools to create a machine translation model capable of translating 1,000 languages, leveraging existing parallel corpora covering approximately 100 languages and a 1,000-language monolingual dataset built from the web.
Specifically, the researchers first emphasized the importance of model capacity in highly multilingual models by comparing the performance of 1.5 billion and 6 billion parameter Transformers on zero-resource translation (3.2) , and then increased the number of self-supervised languages to 1000, verifying that as more monolingual data from similar languages becomes available, performance improves for most long-tail languages (3.3). While the researchers' 1000-language model demonstrated reasonable performance, they incorporated large-scale data augmentation to understand the strengths and limitations of their approach.
In addition, the researchers fine-tuned the generative model on a subset of 30 languages containing a large amount of synthetic data through self-training and back-translation (3.4). They further describe practical methods for filtering synthetic data to enhance the robustness of these fine-tuned models to hallucinations and incorrect language translation (3.5).
We also used sequence-level distillation to refine these models into smaller, easier-to-reason architectures and highlighted the performance gap between teacher and student models (3.6).
The chapter directory is as follows:
To evaluate their machine translation model, the researchers first translated English sentences into these languages and constructed an evaluation set (4.1) for the 38 selected long-tail languages. They highlight the limitations of BLEU in long-tail settings and evaluate these languages using CHRF (4.2).
The researchers also proposed an approximate reference-free metric based on round-trip translation to understand the quality of the model in languages where the reference set is not available, and The quality of the model as measured by this metric is reported (4.3). They performed human evaluation of the model on a subset of 28 languages and reported the results, confirming that it is possible to build useful machine translation systems following the approach described in the paper (4.4).
In order to understand the weaknesses of large-scale multilingual zero-resource models, researchers conducted qualitative error analysis on several languages. It was found that the model often confused words and concepts that were similar in distribution, such as "tiger" became "small crocodile" (4.5). And under lower resource settings (4.6), the model's ability to translate tokens decreases on tokens that appear less frequently.
The researchers also found that these models often cannot accurately translate short or single-word input (4.7). Research on refined models shows that all models are more likely to amplify bias or noise present in the training data (4.8).
The chapter table of contents is as follows:
The researchers conducted some additional experiments on the above models, showing that they generally perform better at directly translating between similar languages without using English as a pivot (5.1), and that they can be used between different scripts Zero-sample transliteration of (5.2).
They describe a practical technique for appending terminal punctuation to any input, called the period trick, which can be used to improve translation quality (5.3) .
Additionally, we demonstrate that these models are robust to the use of non-standard Unicode glyphs in some but not all languages (5.4), and explore several non-Unicode fonts. (5.5).
The chapter list is as follows:
For more research details, please refer to the original paper.
The above is the detailed content of Google has created a machine translation system for 1,000+ 'long tail' languages and already supports some niche languages.. For more information, please follow other related articles on the PHP Chinese website!