Workflow · workflow

Training the Mamba architecture on speech and music data using Determined AI

The author wanted to learn the Mamba architecture through hands-on practice by reproducing an open-source speech synthesis script, and encountered repeated dataset failures that made it hard to produce usable model output.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Download and split audio data

Audio data is downloaded in .mp4 format and split into 10-second .wav files.

Tools used

MambaSpeechTokenizerDetermined AIffmpegWandB

Outcome

Using the Alice in Wonderland audiobook dataset with 4 quantizers, the model produced audio that sounds like the input and is not memorized, with 4 quantizers achieving better training results than 8 quantizers.

What failed first

Three of four candidate datasets failed: Schmidt dialogues were too small causing the model to overfit; SpeechTokenizer discards music making the Taylor Swift dataset unusable; and AI Morgan Freeman audio contained excessive pauses that produced mostly empty model outputs.

Results

Volume12M

Source

https://mlops.community/blog/audio-generation-with-mamba-using-determined-ai

How we source this →

Grounding & classification

Source type: technical build writeup

16 fields verified against source quotes.

content generationfailure mode describedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeup