At 4:53pm Pacific on 2023-01-25 we had a major outage on one of our GPU clusters, resulting in a full outage of the text-davinci-003 model. 80% of cap

OpenAI Status - Multiple engines are down

submited by
Style Pass
2023-01-27 21:30:10

At 4:53pm Pacific on 2023-01-25 we had a major outage on one of our GPU clusters, resulting in a full outage of the text-davinci-003 model. 80% of capacity was back online in approximately one hour. Capacity was fully stabilized in approximately four hours.

The problem was due to a configuration change of "CNI" - the Container Network Interface plugins used to provide network connectivity to containers running on our clusters. Given the high demand of our services, we go to great lengths to utilize any and all GPU capacity available. This involves supporting a variety of different hardware and networking configurations, requiring different CNI configurations. We fully tested this CNI configuration change in a staging environment prior to deployment, but unfortunately our staging environment lacked one particular variation of hardware that only exists in production. The CNI change was incompatible with those servers and caused their workloads to lose network connectivity. 

Engineers immediately identified the problem as due to network connectivity, but it took nearly an hour for CNI to be identified as the cause. The CNI change had been deployed to other clusters over the past 24 hours, so had been deemed safe. The problem only arose once that change had been deployed to a cluster with different hardware. Once the problem was identified, a fix was in place and restored 80% of traffic immediately.

Leave a Comment