There is good news in the field of audio generation: Stability AI has just announced the launch of the open model Stable Audio Open, which can generate high-quality audio data.
Project address: https://huggingface.co/stabilityai/stable-audio-open-1.0
Commercial Stable Audio products with Stability AI ( Instead of generating longer, coherent music tracks of up to three minutes), Stable Audio Open generates up to 47 seconds of high-quality audio data via simple text delivery.
This mockup was created for music production and sound design. It includes drum beats, instrumental riffs, ambient sounds, foley recordings and other audio samples for use in music production and sound design. While it can generate short snippets of music, it's not optimized for full songs, melodies, or vocals.
The main advantage of Stable Audio Open is that users can fine-tune the model based on their own custom audio data.
For example, here is a new beat generated by a drummer fine-tuning his own drum recording sample:Generating audio of birds singing in the forest:Regenerate a piece of rock music with "move times, beat times":Training details and data set
Stable+Audio+Open is based on the Transformer architecture Latent diffusion model. Consists of three components: an autoencoder that compresses waveforms into manageable sequence lengths, a T5-based text embedding for text conditioning, and a Transformer-based diffusion (DiT) model that operates within the autoencoder and text embedding. This model is designed to strike a balance between spatial efficiency and speech generation quality.
Among internal music generators, as some generators, including Stability, become increasingly popular, issues of copyright and possible copyright abuse by some generator creators have come into focus. However, some generator creators are able to use copyright in a way that brings them into the spotlight.
In this model training, stability and artificial intelligence will monitor the quality of the training data to ensure the stability of the model. The "copyright issue" in Vincentian graph model training once caused this company to be deeply involved in controversy. Therefore, Stable+Audio+Open is trained using audio data from FreeSound and Free Music Archive to ensure that no copyrighted or proprietary material is used. This ensures that any person or organization using Stable+Audio+Open will not infringe any copyright or proprietary rights.
The data set contains a total of 486,492 audio records, of which 472,618 are from Freesound and 13,874 are from Free Music Archive. All audio files are CC0, CC BY or CC Sampling+ licensed. This data was used to train the autoencoder and DiT, and the researchers used a publicly available pre-trained T5 model (t5-base) for text conditioning.
Before starting training, the researchers conducted in-depth analysis to ensure that there was no unauthorized copyrighted music in the training data.
They first identified music samples in FreeSound using PANNs music classifier based on AudioSet categories. The identified music samples have at least 30 seconds of music that is predicted to belong to the music-related category with a threshold of 0.15 (PANNs output probabilities range from 0 to 1).
Identified music samples are sent to the identification service of Audible Magic, a trusted content detection company, to ensure that no copyrighted music is present. Audible Magic flags music that appears to be copyrighted, and these are removed before training on the dataset. Most of the deleted content was live recordings with copyrighted background music. After the above processing, the researcher obtained 266324 CC0, 194840 CC-BY and 11454 CC samples + audio records.
The last thing to make sure is that there is no copyrighted content in the FMA subset. In this case the procedure is slightly different since the FMA subset contains musical signals. Researchers conduct metadata searches against large databases of copyrighted music and flag any possible matches, which are reviewed individually by humans. After this process, 8967 CC-BY and 4907 CC0 music were finally obtained.
Limitations
Stable Audio Open 1.0 as an audio generation model also has some limitations, including:
Unable to generate realistic sounds;
Trained using English descriptions, will not perform as well in other languages;
Cannot be applied to all music styles and cultures, the training data lacks diversity, and the model may not perform equally well on the various music genres and sound effects available;
It is sometimes difficult to evaluate which type of text description provides the best generation results, and engineering may be required to obtain satisfactory results.
It is worth noting that Stable Audio Open is an open model, but technically it is not open source. Stable Audio Open does not use an actual Open Source Initiative (OSI)-approved license, but is provided to users under the Stability AI Non-Commercial Research Community Agreement license.
At the same time, Stable Audio Open cannot be used for commercial purposes; the terms of service prohibit this. Furthermore, it does not perform equally well across different musical styles and cultures, or when described in languages other than English.
Stability AI blames this on the training data. The model description reads: "Data sources may lack diversity, and not all cultures are equally represented in the data set. Samples generated by the model will reflect biases in the training data."
The above is the detailed content of Stability AI's open source 47-second audio generation model can generate insects, birds, rock music, and drum beats.. For more information, please follow other related articles on the PHP Chinese website!