Three major challenges of artificial intelligence voice technology-AI-php.cn

Artificial intelligence practitioners often encounter three common obstacles when it comes to speech-to-speech technology.

The prospect of artificial intelligence (AI) being able to generate human-like data has been talked about for decades. However, data scientists have tackled this problem with limited success. Precisely identifying effective strategies for creating such systems poses challenges ranging from technical to ethical and everything in between. However, generative AI has emerged as a bright spot to watch.

At its most basic, generative AI enables machines to generate content from speech to writing to art using elements such as audio files, text and images. Technology investment firm Sequoia Capita said: "Generative AI will not only become faster and cheaper, but in some cases will be better than artificial intelligence created by humans."

Especially based on generative Recent advances in machine learning technology for speech have made huge strides, but we still have a long way to go. In fact, voice compression appears in apps that people rely on heavily, like Zoom and Teams, which are still based on technology from the 1980s and 1990s. While speech has unlimited potential for speech technology, it is critical to assess the challenges and shortcomings that stand in the way of generative AI development.

Here are three common obstacles that AI practitioners face when it comes to speech-to-speech technology.

1. Sound Quality

Arguably the most important part of the best dialogue is that it is understandable. In the case of speech-to-speech technology, the goal is to sound like a human. For example, Siri and Alexa's robotic intonations are machine-like and not always clear. This is difficult to achieve with artificial intelligence for several reasons, but the nuances of human language play a big role.

Merabian's Law can help explain this. Human conversation can be divided into three parts: 55% facial expressions, 38% tone of voice, and only 7% text. Machine understanding relies on words or content to operate. Only recent advances in natural language processing (NLP) have made it possible to train AI models based on mood, emotion, timbre, and other important (but not necessarily spoken) aspects of language. It's even more challenging if you're only dealing with audio, not vision, because not more than half of the understanding comes from facial expressions.

2. Latency

Comprehensive AI analysis may take time, but in voice-to-voice communications, real-time is the only time that matters. Speech conversion must occur immediately when speaking. It also has to be accurate, which as you can imagine is no easy task for a machine.

The necessity of real-time varies by industry. For example, a content creator doing podcasts might be more concerned with sound quality than real-time voice conversion. But in an industry like customer service, time is of the essence. If call center agents use voice-assisted AI to respond to callers, they may make some sacrifices in quality. Still, time is of the essence in delivering a positive experience.

3. Scale

For speech-to-speech technology to reach its potential, it must support a variety of accents, languages, and dialects and be available to everyone—not just specific ones region or market. This requires mastering the specific application of the technology and doing a lot of tuning and training in order to scale effectively.

Emerging technology solutions are not one-size-fits-all; for a given solution, all users will need thousands of architectures to support this AI infrastructure. Users should also expect consistent testing of models. This is not new: all the classic challenges of machine learning also apply to the field of generative AI.

So how do people start to solve these problems so they start to realize the value of speech to speech technology? Fortunately, when you break it down step by step, it's less scary. First, you must master the problem. Earlier I gave the example of a call center and a content creator. Make sure you think about the use cases and desired outcomes and go from there.

Second, make sure your organization has the right architecture and algorithms. But before that happens, make sure your business has the right data. Data quality is important, especially when considering something as sensitive as human language and speech. Finally, if your application requires real-time speech conversion, make sure that feature is supported. Ultimately, no one wants to talk to a robot.

While ethical concerns about generating AI deepfakes, consent, and appropriate disclosure are now emerging, it is important to first understand and address the fundamental issues. Voice-to-speech technology has the potential to revolutionize the way we understand each other, creating opportunities for innovation that brings people together. But in order to achieve this goal, major challenges must first be faced. ?

The above is the detailed content of Three major challenges of artificial intelligence voice technology. For more information, please follow other related articles on the PHP Chinese website!