AI plays an important role in the work we do at Meta. We leverage AI to deliver more personalized experiences and recommendations for people across our family of apps. We’re also committed to advancing the state-of-the-art in generative AI , computer vision, new augmented reality (AR) tools, natural language processing (NLP), and other core areas of AI for a wide range of applications.
Delivering on these commitments means maximizing the performance of every GPU within our AI clusters across three performance pillars: Compute, memory, and network.
Within these pillars, AI cluster performance can be influenced by multiple factors, including model parameters, workload distribution, job scheduler logic, topology, and hardware specs. But focusing on these pillars in isolation leads to local performance optimization efforts that are unable to tap into the full extent of cluster performance. From an organizational perspective, this further leads to decreased efficiencies because multiple efforts with the same goal of increasing cluster performance aren’t being holistically prioritized. These challenges will only grow as large language models (LLMs) become more prevalent.
We need a systemized source of truth that can simulate various performance factors across compute, storage, and network collectively. That’s where Arcadia, Meta’s end-to-end AI system performance simulator, comes in. Arcadia is designed to create a unified simulation framework that accurately models the performance of compute, memory, and network components within large-scale AI training clusters. Using insights from Arcadia, our engineers and developers can make data-driven design decisions for AI clusters and infrastructure that supports it while they are being developed.