The “memory wall” has been a long-standing challenge in computer hardware design—CPUs are getting faster and faster, but bandwidth and latency t

Beyond malloc efficiency to fleet efficiency

submited by

Style Pass

2021-07-19 20:30:06

The “memory wall” has been a long-standing challenge in computer hardware design—CPUs are getting faster and faster, but bandwidth and latency to main memory (or worse, to disk) haven’t kept up. The large working sets of data center workloads have exacerbated this problem, causing translation lookaside buffer (TLB) misses to become a large portion of the “data center tax” of warehouse-scale computers. In this post, we explore one technique for reducing TLB misses and improving application performance: huge pages.

TLBs enable a processor core to map a virtual address in a program to the physical location in memory where the data is held. The TLB caches a limited number of TLB entries; if the mapping is not present in the TLB it needs to be fetched by an expensive operation. For x86 processors, each TLB entry provides a virtual-to-physical mapping for a 4KiB region of memory. In contrast, using huge pages, a single TLB entry provides a mapping for a 2MiB memory region. The same number of TLB entries can now map 512 times the memory; this substantially reduces the number of TLB misses, and their associated costs.

We’ve seen firsthand the improvements that huge pages can bring. In our OSDI 2021 paper, “Beyond malloc efficiency to fleet efficiency,” we describe Temeraire, our huge page-aware improvements to TCMalloc, our production memory allocator. The code for these changes is available on Github.