Hertz-dev is an open-source, first-of-its-kind base model for full-duplex conversational audio. It is an 8.5B parameter transformer trained on 20 mill

si-pbc / hertz-dev like 148 Follow Standard Intelligence 38

submited by
Style Pass
2024-11-16 05:30:02

Hertz-dev is an open-source, first-of-its-kind base model for full-duplex conversational audio. It is an 8.5B parameter transformer trained on 20 million unique hours of high-quality audio data. This repo contains code for both mono- and full-duplex generation; we expect to do a full Transformers library integration in the near future.

Hertz-dev is a base model, without fine-tuning, RLHF, or instruction-following behavior. It can be fine-tuned for almost 𝘢𝘯𝘺 audio modeling task, from live translation to classification. Base models excel at faithfully modeling their training set, and accurate maps come from contact with reality.

From the world’s largest known dataset of high-quality real-world conversational audio, hertz-dev exhibits state-of-the art ability in human-like speech patterns such as pauses and emotional inflections. Hertz-dev has a 80ms theoretical average latency, and benchmarks 120ms real-world latency on a single RTX 4090, which is 1.5-2x lower than the previous state of the art. Low latency is necessary for natural audio, and we're proud to move the field in this direction.

Inference is known to work on Python 3.10 and CUDA 12.1. Other versions have not been tested as thoroughly. If you want to use CUDA 12.1, you'll need to install torch with pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 All three scripts will automatically download the models you need.

Leave a Comment
Related Posts