The Multimodal Live API enables low-latency, two-way interactions that use text, audio, and video input, with audio and text output. This facilitates natural, human-like voice conversations with the ability to interrupt the model at any time. The model's video understanding capability expands communication modalities, enabling you to share camera input or screencasts and ask questions about them.
The session configuration is sent in the first message after connection. A session configuration includes the model, generation parameters, system instructions, and tools.
To send a message the client must send a supported client message in a JSON formatted string with one of over an open WebSocket connection.
To receive messages from Gemini, listen for the WebSocket 'message' event, and then parse the result according to the definition of supported server messages.
Use incremental updates to send text input or establish/restore session context. For short contexts you can send turn-by-turn interactions to represent the exact sequence of events. For longer contexts it's recommended to provide a single message summary to free up the context window for the follow up interactions.