Dia2 is an open-weights, streaming dialogue TTS model. It is capable of generating speech without a full sentence, making it suitable for low-latency speech-to-speech systems. It can generate up to 2 minutes of English audio, and supports audio prefixing.
The inference code and weights (1B / 2B variants) are uploaded to Github and Hugging Face with Apache 2.0 license, to accelerate research. This work was heavily influenced by KyutaiTTS, Mimi, and Sesame. We thank the TPU research cloud for providing computational resources.