Local Text-to-Speech (TTS) and Voice Cloning with mlx-audio
Text to Speech (TTS) is Having a Moment
Text to speech (TTS) is getting really good. The latest models read text naturally and can clone voices with under a minute of reference audio.
What's even more exciting are new models and frameworks that allow you to do this all locally on an everyday laptop. Andrew Mead recently pointed out the mlx-audio library and the new Marvis-TTS model by Prince Canuma that are light enough to allow audio streaming locally.
I recently went down a rabbit hole using mlx-audio to generate audio voice-overs in the style of the CEO of one of SVI's clients for a company demo. Here are the setup notes I wish I had:
mlx-audio in 15 minutes
15 minute tutorial on how to install and use mlx-audio on Mac:
- Create a HuggingFace account if you don't have one already
- Create a HuggingFace read-only access token
Top right Profile->Access Tokens->Create new token
- Add the token to your preferred terminal shell customization file (e.g.
.bashrc
,.zshrc
etc.)
export HF_TOKEN=<your token here>
Make sure to refresh your shell so that the environment variable is present. - Install uv a python package manager to keep everything contained. (If you're not using it already, this is a great article to read: You're probably using uv wrong by Reuven Lerner).
curl -LsSf
https://astral.sh/uv/install.sh
| sh
- Setup a new
uv
project (e.g.mlx-audio-sandbox
)uv init mlx-audio-sandbox
- Enter the newly created project directory
cd mlx-audio-sandbox
- Add mlx-audio to dependencies
uv add mlx-audio
- For voice cloning, create a sample
ref_audio.wav
. To get the best quality voice cloning, I recommend:- Record your voice speaking for at least 30 seconds
- Ensure you have only the voice (no background noise).
- It's better if you include words that might have specific pronunciations for your future generated transcripts.
- Cut the audio to the best ~10s, per the
mlx-audio
recommendation. Longer clips gave more unwanted variation in generation when I tested it. - You can cut the audio and better isolate the voice by asking your favorite LLM:
I have <file name> how can I isolate the voice, trim to the first 10 seconds, and convert the file to .wav using ffpemg?
(See the end of this post for an example snippet that I used to clean up my audio)
- Try out mlx-audio generation:
- TTS with the Marvis 250m model and conversational_a voice:
uv run mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --text "Here is some example text to show you that your script is working" --temperature 0.4 --top_p 0.9 --top_k 50 --play --voice conversational_a
- TTS with the Kokoro-82M model and af_heart voice:
uv run mlx_audio.tts.generate --model prince-canuma/Kokoro-82M --text "Here is some example text to show you that your script is working" --temperature 0.4 --top_p 0.9 --top_k 50 --play --voice af_heart
- TTS with the csm-1b model and a custom voice using the
ref_audio.wav
that you created earlier:uv run mlx_audio.tts.generate --model mlx-community/csm-1b --text "Here is some example text to show you that your script is working" --temperature 0.4 --top_p 0.9 --top_k 50 --play --ref_audio ./ref_audio.wav
- TTS with the Marvis 250m model and conversational_a voice:
- You'll hear the audio once it is completed processing. It will also create an
audio_000.wav
file in your directly. Alternatively, you can add--stream
to have the audio stream as it's generated. - You can also run this as a server with a web interface for the out of box voices
mlx_audio.server --host 0.0.0.0 --port 9000
Where you can access the server onlocalhost:9000
in your web browser.
There are a whole host of other features, including calling this as a python library inside of one's own code and different HuggingFace hosted models that you can pull down and try. This is just scratching the surface to get your started.
Notes
Reactions
Out of box voices are fantastic. When voice cloning is pretty unbelievable when it works too. No magic blackbox in the cloud. It's running right on your machine.
However, voice cloning quality can be variable. Sometimes you get generated speech that sounds exactly like the sample. Other times, you'll get weird background noise or "slamming" sounds at the start or the voice will sound absolutely nothing like the reference voice. There are a few ways to improve this (below).
This has also been a good way to quickly build some more practical muscles using some of the latest generative media tools.
Improving Quality
Rerun Generation - The best bet is to rerun the generation - these are stochastic systems after all so they only work through variability. You'll usually get a good version after at most a few runs.
Reference Audio - I've found that getting a better 10 second sample and keeping the sample to ~10s helps a lot. Longer samples counter intuitively seemed to add more variability from my limited testing.
Parameter Tweaking - You can play around with the parameters referenced in the scripts above to tweak the output. I've found that messing with these too much results in poorer quality:
- temperature = randomness / "creativity" of model predictions (e.g. 0.7 default is reasonably creative)
- top_p = nucleus sampling to control the cumulative probability threshold for token selection (e.g. 0.9 means 90% of probability mass)
- top_k = sampling to limit to the top k most probably tokens at each step (e.g. 50 means top 50 tokens)
Other Models - I'll probably look into other models listed with compatibility, outside of the ones above, to test in the future to try to get higher quality out of the system.
Other Notes
Changing Voices - Each of the models compatible with mlx-audio (e.g. csm-1b) have a Files and versions
tab with a prompts
folder. This prompts folder contains the voices that you can pass as --voice <voice>
.
Note: Do NOT include the file extension. E.g. use --voice conversational_a
without the .wav
.
These voices are a good reference for the reference audio that you create for a custom voice.
Additional References
mlx-audio on Github
ffmpeg Voice Audio Clean Up Script
ffmpeg -y -hide_banner -i input.wav
-af "aformat=channel_layouts=mono,aresample=24000:resampler=soxr:dither_method=triangular,
highpass=f=80,lowpass=f=8000,
afftdn=nf=-23,
acompressor=threshold=-21dB:ratio=2.5:attack=4:release=60:makeup=2"
-ar 24000 -sample_fmt s16 ref_audio.wav
List of mlx_audio.tts.generate Commands Available
uv run mlx_audio.tts.generate --help
usage: mlx_audio.tts.generate [-h] [--model MODEL] [--max_tokens MAX_TOKENS]
[--text TEXT] [--voice VOICE] [--speed SPEED]
[--gender GENDER] [--pitch PITCH]
[--lang_code LANG_CODE]
[--file_prefix FILE_PREFIX] [--verbose]
[--join_audio] [--play]
[--audio_format AUDIO_FORMAT]
[--ref_audio REF_AUDIO] [--ref_text REF_TEXT]
[--stt_model STT_MODEL]
[--temperature TEMPERATURE] [--top_p TOP_P]
[--top_k TOP_K]
[--repetition_penalty REPETITION_PENALTY]
[--stream]
[--streaming_interval STREAMING_INTERVAL]
Generate audio from text using TTS.
options:
-h, --help show this help message and exit
--model MODEL Path or repo id of the model
--max_tokens MAX_TOKENS
Maximum number of tokens to generate
--text TEXT Text to generate (leave blank to input via stdin)
--voice VOICE Voice name
--speed SPEED Speed of the audio
--gender GENDER Gender of the voice [male, female]
--pitch PITCH Pitch of the voice
--lang_code LANG_CODE
Language code
--file_prefix FILE_PREFIX
Output file name prefix
--verbose Print verbose output
--join_audio Join all audio files into one
--play Play the output audio
--audio_format AUDIO_FORMAT
Output audio format
--ref_audio REF_AUDIO
Path to reference audio
--ref_text REF_TEXT Caption for reference audio
--stt_model STT_MODEL
STT model to use to transcribe reference audio
--temperature TEMPERATURE
Temperature for the model
--top_p TOP_P Top-p for the model
--top_k TOP_K Top-k for the model
--repetition_penalty REPETITION_PENALTY
Repetition penalty for the model
--stream Stream the audio as segments instead of saving to a
file
--streaming_interval STREAMING_INTERVAL
The time interval in seconds for streaming segments