Local Text-to-Speech (TTS) and Voice Cloning with mlx-audio

Text to Speech (TTS) is Having a Moment

Text to speech (TTS) is getting really good. The latest models read text naturally and can clone voices with under a minute of reference audio.

What's even more exciting are new models and frameworks that allow you to do this all locally on an everyday laptop. Andrew Mead recently pointed out the mlx-audio library and the new Marvis-TTS model by Prince Canuma that are light enough to allow audio streaming locally.

I recently went down a rabbit hole using mlx-audio to generate audio voice-overs in the style of the CEO of one of SVI's clients for a company demo. Here are the setup notes I wish I had:

mlx-audio in 15 minutes

15 minute tutorial on how to install and use mlx-audio on Mac:

  1. Create a HuggingFace account if you don't have one already
  2. Create a HuggingFace read-only access token
    Top right Profile->Access Tokens->Create new token
  3. Add the token to your preferred terminal shell customization file (e.g. .bashrc, .zshrc etc.)
    export HF_TOKEN=<your token here>
    Make sure to refresh your shell so that the environment variable is present.
  4. Install uv a python package manager to keep everything contained. (If you're not using it already, this is a great article to read: You're probably using uv wrong by Reuven Lerner).
    curl -LsSf https://astral.sh/uv/install.sh | sh
  5. Setup a new uv project (e.g. mlx-audio-sandbox )
    uv init mlx-audio-sandbox
  6. Enter the newly created project directory
    cd mlx-audio-sandbox
  7. Add mlx-audio to dependencies
    uv add mlx-audio
  8. For voice cloning, create a sample ref_audio.wav. To get the best quality voice cloning, I recommend:
    1. Record your voice speaking for at least 30 seconds
    2. Ensure you have only the voice (no background noise).
    3. It's better if you include words that might have specific pronunciations for your future generated transcripts.
    4. Cut the audio to the best ~10s, per the mlx-audio recommendation. Longer clips gave more unwanted variation in generation when I tested it.
    5. You can cut the audio and better isolate the voice by asking your favorite LLM:
      I have <file name> how can I isolate the voice, trim to the first 10 seconds, and convert the file to .wav using ffpemg?
      (See the end of this post for an example snippet that I used to clean up my audio)
  9. Try out mlx-audio generation:
    1. TTS with the Marvis 250m model and conversational_a voice:
      uv run mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --text "Here is some example text to show you that your script is working" --temperature 0.4 --top_p 0.9 --top_k 50 --play --voice conversational_a
    2. TTS with the Kokoro-82M model and af_heart voice:
      uv run mlx_audio.tts.generate --model prince-canuma/Kokoro-82M --text "Here is some example text to show you that your script is working" --temperature 0.4 --top_p 0.9 --top_k 50 --play --voice af_heart
    3. TTS with the csm-1b model and a custom voice using the ref_audio.wav that you created earlier:
      uv run mlx_audio.tts.generate --model mlx-community/csm-1b --text "Here is some example text to show you that your script is working" --temperature 0.4 --top_p 0.9 --top_k 50 --play --ref_audio ./ref_audio.wav
  10. You'll hear the audio once it is completed processing. It will also create an audio_000.wav file in your directly. Alternatively, you can add --stream to have the audio stream as it's generated.
  11. You can also run this as a server with a web interface for the out of box voices
    mlx_audio.server --host 0.0.0.0 --port 9000
    Where you can access the server on localhost:9000 in your web browser.

There are a whole host of other features, including calling this as a python library inside of one's own code and different HuggingFace hosted models that you can pull down and try. This is just scratching the surface to get your started.

Notes

Reactions

Out of box voices are fantastic. When voice cloning is pretty unbelievable when it works too. No magic blackbox in the cloud. It's running right on your machine.

However, voice cloning quality can be variable. Sometimes you get generated speech that sounds exactly like the sample. Other times, you'll get weird background noise or "slamming" sounds at the start or the voice will sound absolutely nothing like the reference voice. There are a few ways to improve this (below).

This has also been a good way to quickly build some more practical muscles using some of the latest generative media tools.

Improving Quality

Rerun Generation - The best bet is to rerun the generation - these are stochastic systems after all so they only work through variability. You'll usually get a good version after at most a few runs.

Reference Audio - I've found that getting a better 10 second sample and keeping the sample to ~10s helps a lot. Longer samples counter intuitively seemed to add more variability from my limited testing.

Parameter Tweaking - You can play around with the parameters referenced in the scripts above to tweak the output. I've found that messing with these too much results in poorer quality:

  • temperature = randomness / "creativity" of model predictions (e.g. 0.7 default is reasonably creative)
  • top_p = nucleus sampling to control the cumulative probability threshold for token selection (e.g. 0.9 means 90% of probability mass)
  • top_k = sampling to limit to the top k most probably tokens at each step (e.g. 50 means top 50 tokens)

Other Models - I'll probably look into other models listed with compatibility, outside of the ones above, to test in the future to try to get higher quality out of the system.

Other Notes

Changing Voices - Each of the models compatible with mlx-audio (e.g. csm-1b) have a Files and versions tab with a prompts folder. This prompts folder contains the voices that you can pass as --voice <voice>.

Note: Do NOT include the file extension. E.g. use --voice conversational_a without the .wav.

These voices are a good reference for the reference audio that you create for a custom voice.

Additional References

mlx-audio on Github

GitHub - Blaizzy/mlx-audio: A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple’s MLX framework, providing efficient speech analysis on Apple Silicon.
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple’s MLX framework, providing efficient speech analysis on Apple Silicon. - Blaizzy/mlx-audio

ffmpeg Voice Audio Clean Up Script

ffmpeg -y -hide_banner -i input.wav
-af "aformat=channel_layouts=mono,aresample=24000:resampler=soxr:dither_method=triangular,
highpass=f=80,lowpass=f=8000,
afftdn=nf=-23,
acompressor=threshold=-21dB:ratio=2.5:attack=4:release=60:makeup=2"
-ar 24000 -sample_fmt s16 ref_audio.wav

List of mlx_audio.tts.generate Commands Available

uv run mlx_audio.tts.generate --help
usage: mlx_audio.tts.generate [-h] [--model MODEL] [--max_tokens MAX_TOKENS]
                              [--text TEXT] [--voice VOICE] [--speed SPEED]
                              [--gender GENDER] [--pitch PITCH]
                              [--lang_code LANG_CODE]
                              [--file_prefix FILE_PREFIX] [--verbose]
                              [--join_audio] [--play]
                              [--audio_format AUDIO_FORMAT]
                              [--ref_audio REF_AUDIO] [--ref_text REF_TEXT]
                              [--stt_model STT_MODEL]
                              [--temperature TEMPERATURE] [--top_p TOP_P]
                              [--top_k TOP_K]
                              [--repetition_penalty REPETITION_PENALTY]
                              [--stream]
                              [--streaming_interval STREAMING_INTERVAL]

Generate audio from text using TTS.

options:
  -h, --help            show this help message and exit
  --model MODEL         Path or repo id of the model
  --max_tokens MAX_TOKENS
                        Maximum number of tokens to generate
  --text TEXT           Text to generate (leave blank to input via stdin)
  --voice VOICE         Voice name
  --speed SPEED         Speed of the audio
  --gender GENDER       Gender of the voice [male, female]
  --pitch PITCH         Pitch of the voice
  --lang_code LANG_CODE
                        Language code
  --file_prefix FILE_PREFIX
                        Output file name prefix
  --verbose             Print verbose output
  --join_audio          Join all audio files into one
  --play                Play the output audio
  --audio_format AUDIO_FORMAT
                        Output audio format
  --ref_audio REF_AUDIO
                        Path to reference audio
  --ref_text REF_TEXT   Caption for reference audio
  --stt_model STT_MODEL
                        STT model to use to transcribe reference audio
  --temperature TEMPERATURE
                        Temperature for the model
  --top_p TOP_P         Top-p for the model
  --top_k TOP_K         Top-k for the model
  --repetition_penalty REPETITION_PENALTY
                        Repetition penalty for the model
  --stream              Stream the audio as segments instead of saving to a
                        file
  --streaming_interval STREAMING_INTERVAL
                        The time interval in seconds for streaming segments

Read more