I decided to have a poke around and see if I could figure out how the HTTP streaming APIs from the various hosted LLM providers actually worked. Here

How streaming LLM APIs work

submited by
Style Pass
2024-09-21 22:30:05

I decided to have a poke around and see if I could figure out how the HTTP streaming APIs from the various hosted LLM providers actually worked. Here are my notes so far.

All three of the APIs I investigated worked roughly the same: they return data with a content-type: text/event-stream header, which matches the server-sent events mechanism, then stream blocks separated by \r\n\r\n. Each block has a data: JSON line. Anthropic also include a event: line with an event type.

Annoyingly these can't be directly consumed using the browser EventSource API because that only works for GET requests, and these APIs all use POST.

The following curl incantation runs a prompt through GPT-4o Mini and requests a streaming respones. The "stream_options": {"include_usage": true} bit requests that the final message in the stream include details of how many input and output tokens were charged while processing the prompt.

That --no-buffer option ensures curl outptus the stream to the console as it arrives. Here's what I got back, with the middle truncated (see this Gist for the whole thing):

Leave a Comment