Ximalaya breaks through the speech overlap problem and wins first place in international conference challenge to accelerate AI innovation-AI-php.cn

Ximalaya breaks through the speech overlapping problem and wins first place in the International Conference Challenge, accelerating AI innovation

Recently, the multi-channel multi-party conference transcription challenge (M2MeT2.0) of the 2023 international top speech conference ASRU (IEEE Automatic Speech Recognition and Understanding, Automatic Speech Recognition and Understanding) came to a successful conclusion, and the Himalayan Everest Laboratory achieved excellence Achievements and won the championship honors.

喜马拉雅在国际会议挑战赛中突破语音重叠难题斩获第一加速AI创新

The ASRU Symposium is the flagship technical event of the IEEE Speech and Language Processing Technical Committee (SLTC), held every two years, bringing together top experts and researchers from academia and industry to discuss a wide range of speech recognition and Understand the problem. The M2MeT2.0 Challenge is a key competition of ASRU in 2023. Its goal is to solve the problem of overlapping speech transcription in offline conference rooms. As a typical "cocktail party scene" where many people talk freely, the meeting scene has always been a difficulty and focus in the field of speech recognition. It is of great significance for developing speech artificial intelligence for meeting scenes and exploring industrial-level solutions to related problems.

It is worth noting that this is not the first time that Himalaya has participated in ASRU’s M2MeT Challenge. In the first M2MeT Challenge, Ximalaya cooperated with the University of Science and Technology of China and won third place in the speaker log track, achieving a log error rate of only 4.05%. In the inaugural challenge, the evaluation uses character error rate (CER) as a metric and only audio is transcribed to text without considering speaker labels. Based on the success of the first session, the M2MeT2.0 Challenge will focus on speaker-related evaluation, promote the practicalization of multi-speaker speech recognition systems, and set up two sub-tracks, limited data and unqualified data.

In order to meet this challenge, the Himalayan Everest Laboratory started from the basic framework of speech recognition and launched technical explorations in aliasing speech detection technology and speaker log technology. Ximalaya achieved excellent first place results in both the limited data set and open data set sub-tracks of the M2MeT2.0 Challenge.

This year’s M2MeT2.0 Challenge data set contains real, multi-scenario, multi-modal large-scale data, covering a variety of conference rooms of different sizes and layouts, simulating various furniture, regular meetings with different themes, and Various indoor noises. These overlapping sounds, such as human voices, TV sounds, fan and air conditioner sounds, keyboard sounds, door opening/closing sounds, bubble sounds, etc., increase the difficulty of the game. By simultaneously using a microphone array to record distant sounds and a headset microphone to record close sounds, accurate transcription of the corresponding speaker's speech is ensured. This data set is of great academic significance for the study of multi-speaker speech recognition and speech overlap problems, and provides real and diverse data resources for finding industrial-level solutions.

All speakers in the M2MeT2.0 Challenge data set are native speakers of Chinese. Ximalaya actively participates in it through a combination of industry, academia and research, and is committed to contributing to the development of China's local speech recognition technology. In the M2MeT2.0 Challenge, Himalaya demonstrated excellent speaker and speech recognition technology (ASR) and demonstrated excellent performance. Its Everest Laboratory team used self-developed speaker recognition, speech enhancement and speech recognition modules. With optimization and experience, significant breakthroughs have been made in speech overlap and multi-speaker environments. By combining deep learning and neural network models, Himalayan Everest Laboratory is able to transcribe and accurately identify and separate the speech of multiple speakers in real time.

Ximalaya related technologies have not only been verified in the ASRU 2023 M2MeT2.0 Challenge, but have also been applied and empowered in Ximalaya AIGC content production. Currently, Ximalaya Automatic Speech Recognition (ASR) technology has been widely used in the AI script function of Ximalaya App. It transcribes the voice content without scripts in the Himalaya platform and outputs the corresponding text, thereby making it easier for the audience to better understand the voice content. . At the same time, for the sound content of the original manuscript, Ximalaya's AI manuscript function uses ultra-long audio and text alignment technology to time-stamp the sound and the manuscript to achieve synchronous highlighting of sound playback and corresponding text, allowing users to It is more convenient to enjoy the content consumption experience of listening and watching at the same time.

喜马拉雅在国际会议挑战赛中突破语音重叠难题斩获第一加速AI创新

In addition to ASR technology, Himalaya’s TTS (speech synthesis) technology is also at the forefront of the industry and has been widely used in the production of storytelling, news, novels and other content. Using the HiTTS technology framework, Shan Tianfang’s “voice” is perfectly reproduced. According to reports, Ximalaya has launched more than 100 albums synthesized by Shan Tianfang's AI synthesized sounds, and the cumulative playback volume has exceeded 100 million times.

For many years, Himalaya has been conducting in-depth research in the field of AI voice technology. Its Everest Laboratory has long been focused on research and innovation in speech synthesis, emotion analysis, speech recognition and other fields. By participating in the ASRU 2023 M2MeT2.0 Challenge and winning the championship, Himalaya further consolidated its leading position in the field of voice technology and demonstrated its excellent ability to solve complex voice scenarios.

As an online audio platform loved by users, Himalaya has always adhered to the concept of empowering culture with technology, constantly integrating technology with creators and users to improve content production efficiency and provide excellent content experience. Ximalaya will also continue to combine advanced and intelligent voice technology with sound through technological empowerment and the integration of industry, academia and research, to provide users with excellent voice technology products and services.

The above is the detailed content of Ximalaya breaks through the speech overlap problem and wins first place in international conference challenge to accelerate AI innovation. For more information, please follow other related articles on the PHP Chinese website!