A Software Architecture for the Future of ML

submited by
Style Pass
2021-05-29 21:30:08

Today’s ML hardware acceleration, whether implemented in silicon (or more recently even using light), is headed towards chips that apply a petaflop of compute to a cell phone size memory. Our brains, on the other hand, are biologically the equivalent of applying a cell phone of compute to a petabyte of memory1. In this sense, the direction being taken by hardware designers is the opposite of that proven out by nature. Why? Simply because we don’t know the algorithms nature uses. 

GPUs bring data in and out quickly, but have little locality of reference because of their small caches. They are geared towards applying a lot of compute to little data, not little compute to a lot of data. The networks designed to run on them therefore execute full layer after full layer in order to saturate their computational pipeline (see Figure 1 below). In order to deal with large models, given their small memory size (tens of gigabytes), GPUs are grouped together and models are distributed across them, creating a complex and painful software stack, complicated by the need to deal with many levels of communication and synchronization among separate machines.

CPUs, on the other hand, have large, much faster caches than GPUs, and have an abundance of memory (terabytes). A typical CPU server can have memory equivalent to tens or even hundreds of GPUs. CPUs are perfect for a brain-like ML world in which parts of an extremely large network are executed piecemeal, as needed. 

Leave a Comment