charactr/vocos-mel-24khz

🧠 AI Modelcharactr

High-quality neural vocoder for mel-spectrogram to 24kHz audio, open-source.

Vocos is a neural vocoder that synthesizes audio from mel-spectrograms at 24kHz sampling rate. It is based on the paper arXiv:2306.00814, which proposes a hybrid approach combining time-domain and Fourier-based methods to achieve high-quality audio synthesis with fast inference. Key innovations include a differentiable time-frequency domain transformation that allows the model to operate in both domains, reducing artifacts and improving efficiency. The model uses a convolutional architecture with residual blocks and adversarial training to generate natural-sounding waveforms. It is optimized for real-time applications and integrates seamlessly with TTS pipelines. Available on HuggingFace with PyTorch, it has garnered over 1.36 million downloads and 41 likes.

💡Highlights

├─1.36M downloads
├─arXiv:2306.00814 paper
└─MIT license

🎯For

├─TTS developers
├─Audio researchers
└─AI enthusiasts

🔗Links

└─HuggingFace Model