Nvidia’s new AI model can generate and mix different types of audio
Nvidia on Monday introduced a new artificial intelligence (AI) model that can generate a variety of audio and mix different types of sounds. The tech giant calls the base model Fugatto, which is short for Foundational Generative Audio Transformer Opus 1. While audio-focused AI platforms such as Beatoven and Suno exist, the company emphasized that Fugatto offers users granular control over the desired output. The AI model can generate or transform any mix of music, voices and sounds based on specific cues.
Nvidia introduces AI audio model Fugatto
In one blog postthe tech giant has detailed its new major language model (LLM). Nvidia said Fugatto can generate music samples, remove or add instruments from an existing song, change accent or emotion in a voice and “even let people produce sounds they’ve never heard before.”
The AI model accepts both text and audio files as input, and users can combine the two to refine their requests. Under the hood, the base model’s architecture is based on the company’s previous work in speech modeling, audio vocoding and audio understanding. The full version uses 2.5 billion parameters and is trained on the datasets of Nvidia DGX systems.
Nvidia highlighted that the team that built Fugatto worked together from several countries around the world, including Brazil, China, India, Jordan and South Korea. The collaboration of people from different ethnic groups also helped develop the AI model’s multi-accent and multi-language capabilities, the company said.
As for the AI audio model’s capabilities, the tech giant highlighted that it has the ability to generate audio output types for which it was not pre-trained. Nvidia highlighted an example: “Fugatto can make a trumpet bark or a saxophone meow. Whatever users can describe, the model can create.”
In addition, Fugatto can combine specific audio capabilities using a technique called ComposableART. It allows users to ask the AI model to generate an audio of a person speaking French with a sad feeling. Users can also control the level of sadness and heaviness of the accent with specific instructions.
Furthermore, the basic model can also generate audio with temporal interpolation, or sounds that change over time. For example, users can generate the sound of a rain shower with thunder clouds disappearing into the distance. These soundscapes can also be experimented with, and even if it’s a sound the model has never processed before, it can create it.
At this time, the company has not shared any plans to make the AI model available to users or enterprises.