I am running a program that, among other things, does some matrix multiplications, singular value decompositions and accessing matrix subsets on a very large data set (these are the lines of code that take the most time).
I am running computations on a CPU with 40 cores and 1TB RAM (yes, 1TB). The dataset itself is ~450GB and another matrix in the code is equally big, meaning I use ~95% of the 1TB memory. I've learned that my computations scale quite badly when I increase the dataset size.
My question is: Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?
3 Answers 3
Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?
Maybe. It depends if your system is NUMA1. This can cause slowdown when one core has to ask another to access its memory
1: Non-uniform memory access
-
1A system with 40 cores and 1TB is definitely NUMA. However, the memory is already split up (probably) 5 or 10 or 20 ways. Using more cores means each core is closer to some memory and farther away from some.Stack Exchange Broke The Law– Stack Exchange Broke The Law2022年11月08日 14:03:37 +00:00Commented Nov 8, 2022 at 14:03
-
1Regardless of NUMA, there is also the possibility to be bottlenecked by memory bandwidth.Stack Exchange Broke The Law– Stack Exchange Broke The Law2022年11月08日 14:05:17 +00:00Commented Nov 8, 2022 at 14:05
Matrix multiplication with a TB of data, memory bandwidth can easily be your bottleneck. You just have 40 CPUs, probably nicely vectorised code... That will put lots of pressure on your memory subsystem. And the better your code, the higher the pressure.
I would start by making sure that you are partitioning the work into chunks that can be handled inside your cache on each CPU. Splitting each matrix into 256 chunks < 2 MB. You might check how much work per dollar you get out of a M1 Mac. Not that much RAM, but tons of bandwidth (800GB per second), very fast SSD / virtual memory and a lot cheaper than any 1TB machine.
PS your caches may have problems if the distance between rows is a power of two. A 4096x4096 matrix could be a very bad idea. If that’s what you have, try changing it to 4100 x 4096 for example.
-
1I bet you the 1TB machine also has 800GB per second memory bandwidth or more.Stack Exchange Broke The Law– Stack Exchange Broke The Law2022年11月10日 09:12:11 +00:00Commented Nov 10, 2022 at 9:12
-
But it also has an enormous price tag. Dell or Apple machines with 1.5TB are between 20,000 and 30,000 dollars. And 800 GB is an awful lot. I wouldn’t bet on it until I have seen the spec.gnasher729– gnasher7292022年11月10日 23:02:29 +00:00Commented Nov 10, 2022 at 23:02
It really depend on the implementation.
A very important part of optimization is to utilize the caches efficiently. While L1 and L2 is typically per core, L3 is often shared. In the absolute worst case, the increased amount of memory requests could cause L3 cache lines to be evicted just before they are needed again.
A well optimized implementation should try to load a section of data that fits into the L1 cache, and do as much computation as possible before loading the next chunk. And ideally do the same with the L2 cache. This typically results in almost unreadable code, but it can give huge performance gains.
So the first thing I would do is check the library. I would expect common and popular libraries like numpy to perform well. Or libraries where the main selling point is performance, like Intel performance primitives. I would expect internal or smaller libraries to perform worse since optimization requires a huge time investment.
workstation-class processors typically also scale the amount of cache and memory channels with the number of cores. If you have a 40 core processor you likely have far more bandwidth available than a single core will use. 4-8 cores per memory channel seem fairly common today.
A well optimized library should make optimal usage of the resources available. In some cases manual tuning parameters might help, but this will likely be a trial and error process. But multiplying huge matrices is just fundamentally slow, so you will likely not solve your problem of poor scaling.
-
Just an idea: If your code is multithreaded (I bet it is), you may be able to control how many threads are used. And determine the optimal number.gnasher729– gnasher7292022年11月09日 16:50:44 +00:00Commented Nov 9, 2022 at 16:50
-
My Mac at home: L1 is per core. L2 is per group of 4 cores. L3 is shared between all cores and GPUs. And fast memory with 16 byte lines.gnasher729– gnasher7292023年03月13日 14:09:45 +00:00Commented Mar 13, 2023 at 14:09
numastat
(on Linux) and whatever perf counters your OS/CPU offer to get an idea of what changes as your dataset grows.