Workflow · workflow
Training the Mamba architecture on speech and music data using Determined AI
The author wanted to learn the Mamba architecture through hands-on practice by reproducing an open-source speech synthesis script, and encountered repeated dataset failures that made it hard to produce usable model output.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Download and split audio data
Audio data is downloaded in .mp4 format and split into 10-second .wav files.
Tools used
MambaSpeechTokenizerDetermined AIffmpegWandB
Outcome
Using the Alice in Wonderland audiobook dataset with 4 quantizers, the model produced audio that sounds like the input and is not memorized, with 4 quantizers achieving better training results than 8 quantizers.
What failed first
Three of four candidate datasets failed: Schmidt dialogues were too small causing the model to overfit; SpeechTokenizer discards music making the Taylor Swift dataset unusable; and AI Morgan Freeman audio contained excessive pauses that produced mostly empty model outputs.
Results
Volume12M
Grounding & classification
Source type: technical build writeup
16 fields verified against source quotes.
content generationfailure mode describedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeup