Notes on Google's New Dialog Model

submited by

Style Pass

2024-11-17 20:30:14

Google released an API for their 2-way dialog model, the same technology that powers the podcast generation feature in NotebookLM. API docs here.

It supports only 2 speakers. The dialog is represente as a series of speaking turns, each of which is assigned a speaker. There are 2 male and 2 female speakers supported, no custom voice or voice clones as of today.

The model outputs are extremely natural, which is what makes the NotebookLM podcast generation so uncanny. I have not heard any other commercial product that comes close. Play has a dialog generation model with custom voice options but the voices sound less natural, more metallic, and has more audio artifacts.

Note that in the above example, “Well..” and “well what?” are meant to be uttered by two different speakers. But in my generations, they consistently were uttered by the same speaker. Tweaking the transcript by fixing the ellipses usage (note that the above example from Google’s docs has 2 periods instead of 3 in the ellipses) or adding filler words fixed the issue. Proper punctuation (or more robust training data) is critical!

When one speaker has a long utterance, the model sometimes naturally inserts filler feedback words (“um”, “ah”, “hmm”). (Excuse the text copy/pasted from a YouTube description.)