OpenAI Status - Multiple engines are down

submited by

Style Pass

2023-01-27 21:30:10

At 4:53pm Pacific on 2023-01-25 we had a major outage on one of our GPU clusters, resulting in a full outage of the text-davinci-003 model. 80% of capacity was back online in approximately one hour. Capacity was fully stabilized in approximately four hours.

The problem was due to a configuration change of "CNI" - the Container Network Interface plugins used to provide network connectivity to containers running on our clusters. Given the high demand of our services, we go to great lengths to utilize any and all GPU capacity available. This involves supporting a variety of different hardware and networking configurations, requiring different CNI configurations. We fully tested this CNI configuration change in a staging environment prior to deployment, but unfortunately our staging environment lacked one particular variation of hardware that only exists in production. The CNI change was incompatible with those servers and caused their workloads to lose network connectivity.

Engineers immediately identified the problem as due to network connectivity, but it took nearly an hour for CNI to be identified as the cause. The CNI change had been deployed to other clusters over the past 24 hours, so had been deemed safe. The problem only arose once that change had been deployed to a cluster with different hardware. Once the problem was identified, a fix was in place and restored 80% of traffic immediately.

OpenAI can translate English into code with its new machine learning software Codex

Comment

German startup Aleph Alpha raises $27M Series A round to build ‘Europe’s OpenAI’

Comment

SpaceX installed 29 Raptor engines on a Super Heavy rocket last night

Comment

Evan Hubinger on Effective Altruism and AI Safety

Comment

F-Secure: AI-based recommendation engines are easy to manipulate

Comment

The Space Shuttle Engines Will Rise Again

Comment

Disassembling Jak & Daxter

Comment

Why I prefer making useless stuff

Comment

How to confirm your NHS COVID Vaccine status for travelling on holiday

Comment

Sequoia Heritage, Stripe and others invest $200M in African fintech Wave at $1.7B valuation

Comment

OpenAI Status - Multiple engines are down

Leave a Comment

Related Posts

OpenAI can translate English into code with its new machine learning software Codex

German startup Aleph Alpha raises $27M Series A round to build ‘Europe’s OpenAI’

SpaceX installed 29 Raptor engines on a Super Heavy rocket last night

Evan Hubinger on Effective Altruism and AI Safety

F-Secure: AI-based recommendation engines are easy to manipulate

The Space Shuttle Engines Will Rise Again

Disassembling Jak & Daxter

Why I prefer making useless stuff

How to confirm your NHS COVID Vaccine status for travelling on holiday

Sequoia Heritage, Stripe and others invest $200M in African fintech Wave at $1.7B valuation

Recent Posts

About 1 in 4 US adults 50 and older who aren't yet retired expect to never retire, AARP study finds

China's Temu Takes Over 17% Of US Market Share, Cutting Jobs From American Amazon And Decimating Small Businesses

Open Sourcing DOS 4 - Scott Hanselman's Blog

An artificial mind with a lifelike body: Las Vegas man brings creations to life

Coverage Guided Fuzzing – Extending Instrumentation to Hunt Down Bugs Faster!

Building a Multi-Tenant Web App in 2024 - EGOIST

Search code, repositories, users, issues, pull requests...

Cops cuff man for allegedly framing colleague with AI-generated hate speech clip

U.S. “Know Your Customer” Proposal Will Put an End to Anonymous Cloud Users

What is Amazon WorkDocs

New deep-water channel allows first ship to pass Key bridge wreckage in Baltimore

Why Your Vet Bill Is So High

Get Ready for the AM5 Next Gen. RyzenTM CPU with GIGABYTE Latest BIOS update | News - GIGABYTE Global

Rule of 40 — I don’t see you

Understanding the Redpanda Data Transform architecture

Defense companies are luring Germany’s struggling autoworkers

Earn 8X more watching Youtube videos here

How marketing classes can rescue ‘ugly produce’ from becoming food waste

30 Virtual Machine Administration Using QEMU Monitor #

IBM acquires HashiCorp for $6.4 billion