This article discusses the high level details with std::vector, glibc and how allocating large memory regions impact CPU usage under extreme CPU and memory pressure.
As a gist, prefer direct mmap with MAP_SHARED|MAP_ANONYMOUS over std::vector or malloc that uses mmap internally with MAP_PRIVATE|MAP_ANONYMOUS.
We recently rewrote one of our legacy large scale biometric matcher product (a custom map-reduce implementation) that uses multi-process architecture to multi-threaded architecture so that it uses less resource to run in a docker container on Kubernetes as the per-process overhead can be eliminated. We rolled this out to few cloud deployments with smaller gallery sizes in the order of hundreds of thousands and it all went well. At least until we pushed the system to it’s limits and increased the gallery sizes to tens of millions.
On a host machine with 96 core and 256 GiB RAM, 235 GiB was filled with the gallery to be searched and each of the CPU core was affined to a search worker thread so that the searches are efficient. The default tmpfs on the host was untouched and was left at 50% (128 GiB) on /dev/shm/ where POSIX shared memory is managed. System swaps were turned off to ensure there are not major page faults. The docker container loaded all the datasets from the file system without any hiccups. The expectation was the response time per request will be around 11 seconds per core/request as the algorithm skims through the datasets, looking for a biometric match. When the performance/load tests were started with 96 concurrent search requests, system/kernel CPU usage spiked up and in turn increased the response latencies by 10 folds. We anticipated to see ~1% system CPU usage as we had carefully designed all the memory allocations and locking strategies to ensure there are no page-faults. But it turned out to be false.