The WSE-3 packs 900,000 AI cores onto a single processor. Each core on the WSE is independently programmable and optimized for the tensor-based, sparse linear algebra operations that underpin neural network training and inference for deep learning, enabling it to deliver maximum performance, efficiency, and flexibility.
Unlike traditional devices, in which the working cache memory is tiny, the WSE-3 takes 44GB of super-fast on-chip SRAM and spreads it evenly across the entire surface of the chip. This gives every core single-clock-cycle access to fast memory at extremely high bandwidth – 21 PB/s. This is 880x more capacity and 7,000x greater bandwidth than the leading GPU.
The WSE-3 on-wafer interconnect eliminates the communication slowdown and inefficiencies of connecting hundreds of small devices via wires and cables. It delivers an incredible 214 Pb/s processor-processor interconnect bandwidth. That’s more than 3,715x the bandwidth delivered between graphics processors.
Programming a cluster to scale deep learning is painful. It typically requires dozens to hundreds of engineering hours and remains a practical barrier for many to realize the value of large-scale AI for their work.