What You Can Train — The Unsloth TTS Notebooks
The Unsloth team has built a library of fine-tuning notebooks that make training large models dramatically more efficient. Their TTS notebooks bring that same accessibility to speech models — you can train a custom voice in about 15 minutes on Colab's free T4 GPU.
Available models include:
Spark TTS
Compact 0.5B parameter model — fast to train, easy to run locally on 4GB VRAM. Great starting point for beginners.
Orpheus TTS
Highly expressive with emotion tags and zero-shot voice cloning. Great quality, slightly more demanding to run.
Sesame CSM TTS
Conversational speech model with natural turn-taking and dialogue flow — ideal for voice assistant applications.
OutTTS
Advanced control over audio output — suited for users who want fine-grained tuning of delivery and style.
What You'll Need Before Starting
- A Google account — for Colab access
- A Hugging Face account — for downloading models and uploading your dataset
- Audio recordings of your target voice — clear, good quality clips work best
Step 1 — Set Up Your Colab Environment
- Open the Unsloth Spark TTS Notebook — head to the Unsloth documentation page (link below) and open the Spark TTS Colab notebook. Sign into your Google account if prompted.
- Connect to the T4 GPU Runtime — click the connect button in the top-right. Go to Change runtime type and select T4 GPU. This is free — no Colab Pro needed.
- Create a Hugging Face Access Token — on Hugging Face, go to Settings → Access Tokens and create a new token with write permissions. Copy it — you'll need it in the next step.
- Add Your HF Token to Colab Secrets — in the Colab left sidebar, click the key icon (Secrets). Add a new secret named
HF_TOKEN(all caps, underscore). Paste your token and toggle it to be accessible by the notebook.
Step 2 — Prepare Your Training Dataset
TTS fine-tuning requires a dataset of audio clips paired with their text transcriptions. The format for single-speaker models is typically text and audio columns; multi-speaker models add a source column.
You have two options:
- Pre-made datasets — search Hugging Face for existing TTS datasets you can use directly
- Custom dataset — record your own audio and build the dataset from scratch
The TTS Dataset Creator Tool
To streamline custom dataset creation, The Local Lab built a Python/Gradio tool called the TTS Dataset Creator. Here's what it does automatically:
- Accepts audio file uploads of your target voice
- Cuts them into 10-second clips (the optimal length for TTS training)
- Transcribes each clip using a local Whisper Small model
- Packages everything into a .parquet file ready for Hugging Face upload
- Opens the dataset in Renumix Spotlight for local visual inspection
Uploading Your Dataset to Hugging Face
- Create a New Dataset Repository on HF — on Hugging Face, create a new dataset repo. Navigate to Files and Versions and upload your
.parquetfile. - Copy Your Repo Name — copy the repository name in the format
your-username/your-dataset-name. You'll paste this into the Colab notebook's data preparation cell.
Step 3 — Run the Training
- Run the Setup Cells — run the first cells in order — they install dependencies and download the base Spark TTS model. Don't skip any cells or interrupt this process.
- Load Your Dataset — in the data preparation cell, find the
load_dataset()call and paste your Hugging Face repository name inside the parentheses. Run the cell, then run the tokenization cell that follows. - Handle the BF16 Bug (if needed) — if you hit an error related to BF16 detection, install a slightly older Unsloth version:
After installing, restart the Colab runtime and re-run the last three import cells before continuing.
- Run the Trainer Cell — run the trainer cell to begin fine-tuning. Monitor training loss and VRAM usage as it runs. On the free T4 GPU, Spark TTS typically finishes in ~15 minutes.
- Test and Download the Model — once training completes, use the inference cell to test your fine-tuned model. Then download the output files from the
outputs/folder to your local machine.
Running Your Trained Model Locally
Spark TTS runs on as little as 4GB VRAM, making it one of the most accessible local TTS options available. To use your fine-tuned version locally:
- Install Spark TTS locally (one-click installer on Patreon, or follow the GitHub repo setup)
- Navigate to the Spark TTS models directory
- Replace the original model files with the files you downloaded from Colab
- Launch Spark TTS — it will now speak in your fine-tuned voice
📦 Want to skip the setup?
The Local Lab offers pre-configured AI installer packages so you can get running in minutes, not hours.
Browse the Store →