Efficient AI Computing, Transforming the Future.

submited by

Style Pass

2025-08-08 09:00:03

We discovered why language models catastrophically fail on long conversations: when old tokens are removed to save memory, models produce complete gibberish. We found models dump massive attention onto the first few tokens as "attention sinks"—places to park unused attention since softmax requires weights to sum to 1. Our solution, StreamingLLM, simply keeps these first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models.

This week, OpenAI made headlines by releasing their first open-source large language models, GPT-OSS-20B and GPT-OSS-120B. Buried in the technical documentation was a fascinating architectural detail: the inclusion of attention sink mechanisms.

This simple modification—adding just one learnable parameter per attention head—enables the model to "pay no attention to any tokens" when needed, a design choice OpenAI's model card explicitly attributes to our StreamingLLM work.

Efficient AI Computing, Transforming the Future.

Leave a Comment

Related Posts

Recent Posts

Distributed Systems 101

LLM as a Jailbreak Judge

libsigsegv - GNU Project - Free Software Foundation

Computer Science > Performance

Search code, repositories, users, issues, pull requests...

Microsoft, CISA warn yet another Exchange server bug can lead to 'total domain compromise'

How we replaced Elasticsearch and MongoDB with Rust and RocksDB

Why Students Stopped Walking To School

US Tariff Policy Database

Rocket Report: Firefly lights the markets up; SpaceX starts selling trips to Mars

Modeling Symmetries | Mark Neumann

Search code, repositories, users, issues, pull requests...

Invitation Is All You Need: Invoking Gemini for Workspace Agents with a Simple Google Calendar Invite

A Creative Writing Game That Makes Kids LOVE Writing

Building a Global MaaS Platform

The Dereliction of Due Process

Meta Shares Details Of Ultra-Wide FOV & "Hyperrealistic VR" Prototype Headsets

Food, housing, and health care costs are a source of major stress for many people

Computer Science > Machine Learning