Home  >  Article  >  Technology peripherals  >  Application and research of industry search based on pre-trained language model

Application and research of industry search based on pre-trained language model

WBOY
WBOYforward
2023-04-08 11:31:111704browse

Application and research of industry search based on pre-trained language model

##1. Background of industry search

1. Damo Academy Natural Language Intelligence Big Picture

Application and research of industry search based on pre-trained language model


The above picture is the technical block diagram of DAMO Academy’s natural language processing intelligence, which includes from bottom to top:

  • NLP data, NLP basic lexicon, syntax and semantics, analysis technology, and upper-level NLP technology
  • ##Industry application: DAMO Academy does more than In addition to basic research, it also empowers Alibaba Group and integrates with Alibaba Cloud to empower industries. Many industry scenarios for empowerment are search.
#2. Nature of industry search

Application and research of industry search based on pre-trained language model


##The essence of search for industrial and consumer Internet is the same: users have information acquisition needs and at the same time have an information resource library, and the two are bridged through search engines.

# Take the e-commerce scenario as an example. For example, a user searches for aj1 North Carolina blue new sneakers in an e-commerce store. In order to better understand such a user's query, a series of tasks need to be performed:

    Analysis of query understanding: NLP error correction, word segmentation and category Prediction, entity recognition word weight, query rewriting and other technologies
  • (offline) document analysis: NLP analysis, quality efficiency analysis
  • Retrieval and sorting: Through the analysis of query and document, combined with some retrieval and sorting mechanisms of the search engine itself, the goal of bridging the two can be achieved.
  • 3. Industry search link

Application and research of industry search based on pre-trained language model


##If divided according to the search paradigm, it is generally divided into sparse retrieval and dense retrieval.

sparse retrieval: Traditionally build an inverted index based on words or words, and at the same time build a series of capabilities for query understanding based on this. , including some text relevance sorting, etc.;
  • dense retrieval: With the rise of pre-trained language models, single towers and double towers are realized based on pre-trained bases model, and then combined with the vector engine to establish a search mechanism.

Application and research of industry search based on pre-trained language modelGenerally, the search is divided into links: recall, sorting (rough sorting, fine sorting) Arrange, rearrange).

Application and research of industry search based on pre-trained language model

#Recall phase:

  • Keyword recall of traditional sparse retrieval
  • ##dense retrieval vector recall, personalized recall
  • Rough sorting stage: Use text relevance (static) scores to filter
  • Fine sorting stage: Relatively complex, there will be correlation models, which may be combined with the business efficiency model (LTR)

Application and research of industry search based on pre-trained language model

From From left to right, the model complexity and effect accuracy become higher. From right to left, the number of Docs processed increases. Take Taobao e-commerce as an example, such as recall (billions), preliminary ranking (hundreds of thousands), fine ranking (hundreds, thousands), and rearrangement (tens).

#Searching for production links is a system where retrieval effect and engineering efficiency are trade-off. As computing power increases, complex models begin to be replaced. For example, models that have been finely sorted will now gradually move to the stage of rough sorting or even recall.

Application and research of industry search based on pre-trained language model

## Search effectiveness evaluation:

  • Recall: recall or no result rate
  • Ordering: relevance, conversion efficiency (close to business)
  • Relevance: NDCG, MRR
  • Conversion efficiency: click-through rate, conversion rate
4. Search on consumer Internet and industrial Internet

Application and research of industry search based on pre-trained language model

Search is very different in different industry scenarios. , here it is divided into consumer Internet search and industrial Internet search:

  • user group and UV: The consumer Internet search UV is very large, and the industrial Internet is targeted at employees within government and enterprises.
  • Search pursuit indicators: In consuming the Internet, in addition to pursuing search results and accurate searches, we also pursue High conversion rate. In the industrial Internet, it is more about the need for information matching, so focus on recall and relevance.
  • Engineering system requirements: The consumer Internet QPS requirements will be very high, and a large number of user behaviors will be accumulated, which requires There are real-time log analysis and real-time model training. The requirements for the industrial Internet will be lower.
  • Algorithm direction: The consumer Internet will be obtained from the analysis and modeling of massive user behavior offline, nearline, and online Greater benefits. The user behavior of the industrial Internet is sparse, so it will pay more attention to content understanding, such as NLP or visual understanding. Research directions include low resource and transfer learning.
2. Research on related technologies

Application and research of industry search based on pre-trained language model

##Search is with The system framework is tightly coupled: including offline data, search service framework (green part), and search technology algorithm system (blue part). Its base is the Alicemind pre-trained language model system, which will also converge on document analysis, query understanding, and correlation. wait.

1. AliceMind system

Application and research of industry search based on pre-trained language model

AliceMind is a hierarchical pre-training language model system built by DAMO Academy. Contains general pre-training models, multi-language, multi-modal, dialogue, etc., and is the base for all NLP tasks.

2. Word segmentation

Application and research of industry search based on pre-trained language model

Search word segmentation (atomic capability), It determines the retrieval index granularity, and is also related to subsequent correlation and BM25 granularity. For task specific tasks, if you customize some pre-training, the effect will be better than general pre-training. For example, recent research hopes to add unsupervised statistical information to the native BERT pre-training task, such as statistical words, Gram granularity, or boundary entropy, and then add mse-loss to the pre-training. On CWS/POS and NER (picture on the right), many tasks have reached SOTA.

Application and research of industry search based on pre-trained language model

Another study is cross-cutting. The cost of labeling data and constructing supervision tasks every time is very high, so it is necessary to build a cross-domain unsupervised word segmentation mechanism. The table in the lower right corner is an example. The quality of e-commerce word segmentation has been significantly improved compared to open source word segmentation. This method has also been released to ACL2020.

3. Named entity recognition

Application and research of industry search based on pre-trained language model


##Search named entity recognition mainly involves structured understanding of query and Doc, and identifying key phrases and types. At the same time, the construction of the search knowledge graph also relies on the NER function.

#Searching for NER also presents some challenges. The main reason is that queries are often relatively short and lack context. For example, the query entity in e-commerce is highly ambiguous and knowledgeable. Therefore, the core optimization idea of ​​NER in recent years is to enhance the representation of NER through context or the introduction of knowledge.

Application and research of industry search based on pre-trained language model

We did implicit enhancement work combo embedding in 2020 and 2021. By dynamically integrating existing word extractor or GLUE representations, it can be used on many business tasks to achieve SOTA.

In 2021, we will develop explicit retrieval enhancement. A piece of text will get enhanced context through the search engine and integrate it into the transformer structure. This work was published in ACL 2021.

Based on this work, we participated in the SemEval 2022 multi-language NER evaluation and won 10 championships, as well as the best system paper.

Application and research of industry search based on pre-trained language model


Search enhancement: In addition to the input sentence itself, additional context is retrieved and concat to the input, combined with KL's loss to help learning. Obtained SOTA in many open source data sets.

4. Adaptive multi-task training

Application and research of industry search based on pre-trained language model

BERT itself is very effective, but the actual production is very small There is a GPU cluster, and inference is required for each task, which is very costly in terms of performance. We think about whether we can only do inference once, and then adapt each task by itself after the encoder, so that we can get better results.

Application and research of industry search based on pre-trained language model

An intuitive way is to incorporate NLP query analysis tasks through the meta-task framework. But the traditional meta-task is a uniformly sampled distribution. We propose MOMETAS, an adaptive meta-learning based method that self-adapts sampling for different tasks. In the process of learning multiple tasks, we will periodically use validation data for testing to see the effects of different task learning. reward in turn guides the sampling of previous training. (Table below) Combining this mechanism on many tasks has a lot of improvements compared to UB (uniform distribution).

Application and research of industry search based on pre-trained language model

Apply the above mechanism to search scenarios in many industries, and the benefits will be achieved through BERT only once Encoding and storing can be directly reused in many downstream tasks, which can greatly improve performance.

5. Search recall pre-trained language model

Application and research of industry search based on pre-trained language model

Deep retrieval, It is nothing more than two towers or a single tower. The common training paradigm is supervised signals and pre-trained models. Finetune is performed to obtain embedding, and query and doc are represented. The recent optimization routes are mainly data enhancement or difficult sample mining, and the other is optimizing pre-trained language models. Native BERT is not a particularly well-suited text representation for searching, so there are pre-trained language models for searching text representations. Other optimizations lie in multi-view text representation and special loss design.

Application and research of industry search based on pre-trained language model

Compared with the random sampling of native BERT, we combine search word weights to improve words with higher word weights to improve sampling. Probabilistically, learned representations are better suited for search recall. In addition, sentence level comparative learning is added. Combining these two mechanisms, a pre-trained language model of ROM is proposed.

Application and research of industry search based on pre-trained language model

## Do experiments at MS MARCO to achieve the best results compared to previous practices. In actual scene search tasks, it can also bring great improvements. At the same time, this model also participated in MS rankings.

6. HLATR rearrangement model

Application and research of industry search based on pre-trained language model

Except for the ROM recall stage In addition, in the fine ranking and reranking stage, a set of list aware Transformer reranking is proposed, which organically integrates the results of many classifiers through the Transformer, resulting in a relatively large improvement.

Application and research of industry search based on pre-trained language model

Combining the two solutions of ROM and HLATR, the results from March to now (July) are still SOTA.

##3. Industry search application

1. Address analysis product

Application and research of industry search based on pre-trained language model

The address analysis product developed by DAMO Academy is based on the fact that there are many correspondence addresses in various industries. Chinese correspondence addresses have many characteristics, such as many defaults in colloquial expressions. At the same time, the address itself is a person or thing, and it is an important entity unit that bridges many entities in the objective world. Therefore, based on this, a set of address knowledge graph was established to provide parsing, completion, search, and address analysis.

Application and research of industry search based on pre-trained language model

This is the technical block diagram of the product. From bottom to top, it includes the construction of the address knowledge graph and the address pre-training language model, including a search engine-based framework to connect the entire link. The benchmark capabilities mentioned above are provided in the form of APIs and packaged into industry solutions.

Application and research of industry search based on pre-trained language model

One of the more important points in this technology is the pre-trained language model of geographical semantics. An address will be represented as a string in text, but in fact it is often represented as longitude and latitude in space, and there are corresponding pictures on the map. Therefore, the information of these three modalities is organically integrated into a multi-modal geo-semantic language model to support the tasks in location.

Application and research of industry search based on pre-trained language model

As mentioned above, many basic capabilities related to addresses are required, such as word segmentation, error correction, structuring and other analyses.

Application and research of industry search based on pre-trained language model

The core link is to bridge them by bridging the geographical pre-training language model, address basic tasks, and triggering search engines. For example, if you search for Zhejiang No. 1 Hospital, you may perform structuring, synonym correction, term weighting, vectorization, and Geohash prediction on it. Make a recall based on the analysis results. This link is a standard search link that performs text recall, pinyin recall, vector recall, and also adds geographic recall. Recall is followed by multi-stage sorting, including multi-granular feature fusion.

Application and research of industry search based on pre-trained language model

The intuitive application of the address search system is to fill in the address and search in the suggestion scene, or search in the Amap map, which needs to be mapped to the space. At one point.

Application and research of industry search based on pre-trained language model

Next, we will introduce two relatively industrial application solutions. The first one is the new retail Family ID. The core requirement is to maintain a customer management system. However, user information in each system is not connected and effective integration cannot be achieved.

Application and research of industry search based on pre-trained language model

For example, when a brand manufacturer sells an air conditioner, the family members register various addresses and mobile phone numbers due to the purchase, installation, and maintenance, but the corresponding addresses are actually the same address. The established address search normalization technology normalizes addresses with different representations, generates fingerprints, and aggregates different user IDs into the Family concept.

Application and research of industry search based on pre-trained language model


Application and research of industry search based on pre-trained language model

# Concept of aggregation through family , can achieve better penetration analysis, advertising reach and other marketing activities under new retail.

Application and research of industry search based on pre-trained language model

Another application scenario is 119, 129, emergency and other intelligent alarm receiving applications. Because the personal and property safety of the people is involved, every second counts. We hope to improve this efficiency by combining speech recognition and text semantic understanding technologies.

Application and research of industry search based on pre-trained language model

(Example on the left) The scene has many characteristics, such as typos, unfluency, and colloquialism in ASR transcription. . The goal is to infer the location of an alarm based on automated speech transcription analysis.

Application and research of industry search based on pre-trained language model


Application and research of industry search based on pre-trained language model

Application and research of industry search based on pre-trained language model

# We have proposed a complete set of system solutions, including smooth spoken language error correction for dialogue understanding, intent recognition, and a set of search and recall mechanisms to ultimately achieve address recommendation. The link is relatively mature and has been implemented in fire protection systems in hundreds of cities in China. Firefighters identify specific locations from alarm conversations, combine recommendation, matching, and address fences to determine the specific locations and send out alarms accordingly.

2. Education photo search topic

Application and research of industry search based on pre-trained language model

Next, we will introduce the education industry The photo collection business also has a lot of demand in To C and for teachers.

Application and research of industry search based on pre-trained language model

Photo search questions have several characteristics. It has an incrementally updated question bank and has a large user base. In addition, the fields corresponding to different disciplines and age groups are very knowledgeable. At the same time, it is a multi-modal algorithm, with a set of links from OCR to subsequent semantic understanding and search.

Application and research of industry search based on pre-trained language model

In recent years, a complete set of links from algorithms to systems has been built for photo collection.

Application and research of industry search based on pre-trained language model

#For example, after taking a picture with a mobile phone and OCR recognition, a series of tasks such as spelling correction, subject prediction, word segmentation, and word weighting will be performed to help with retrieval. .

Application and research of industry search based on pre-trained language model

Since OCR does not recognize spaces in English, a set of K12 English pre-training algorithm models were trained to perform English Segmentation.

Application and research of industry search based on pre-trained language model

At the same time, the subjects and question types are unknown and need to be predicted in advance. Use multimodality to combine images and text for intent understanding.

Application and research of industry search based on pre-trained language model

Photo search questions are different from ordinary user searches. User searches often have shorter queries, while photo search questions It is often a complete question. Many words in the question are unimportant, and it is necessary to do word weight analysis, discard unimportant words or sort them to downgrade them.

Application and research of industry search based on pre-trained language model

The most obvious optimization effect in the photo search scene is vector recall. Performance requirements make it difficult to use the OR recall mechanism and need to use AND logic. The corresponding feature is that there are relatively few recalls. To improve recall, you need to do more redundant modules such as term weighting and error correction. (Right picture) The multi-channel recall effect of text plus vector exceeds that of pure OR logic, and the latency is reduced by 10 times.

Application and research of industry search based on pre-trained language model

Photo search links include image vector recall, formula recall, and personalized recall.

Application and research of industry search based on pre-trained language model

Provide two examples. The first one is the OCR result of plain text. (Left column) The old result is based on ES, simple OR recall, plus the result of BM25. (Right column) The link after multi-channel recall and correlation recall has been greatly improved. .

#The second is to take pictures containing graphics, which must be combined with picture recall in multi-channel.

3. Unified search of power knowledge base

Application and research of industry search based on pre-trained language model

Application and research of industry search based on pre-trained language model


##There is a lot of semi-structured and unstructured data in enterprise search, providing unified search to help enterprises integrate data resources. Not only in electric power, other industries also have similar needs. The search here is no longer a narrow search, but also includes the AI ​​of document preprocessing and the construction of a knowledge graph, as well as the ability to subsequently bridge questions and answers. The above is a schematic diagram of creating a set of institutional standard texts in the electric power knowledge base, from structuring to retrieval to application.

The above is the detailed content of Application and research of industry search based on pre-trained language model. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete