Memory on multiple cores versus 1 core

Question 1

I am running a program that, among other things, does some matrix multiplications, singular value decompositions and accessing matrix subsets on a very large data set (these are the lines of code that take the most time).

I am running computations on a CPU with 40 cores and 1TB RAM (yes, 1TB). The dataset itself is ~450GB and another matrix in the code is equally big, meaning I use ~95% of the 1TB memory. I've learned that my computations scale quite badly when I increase the dataset size.

My question is: Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?

Question 2

The specific architecture & topology of your CPUs, interconnects, caches and memory does affect performance. Unfortunately no general analysis of this is likely to fit in an answer here, so you really need to describe all of those things (plus your program's access patterns, how performance varies with dataset size, and what profiling you did) to get anything resembling an answerable question.

Question 3

To start with, you can look at things like numastat (on Linux) and whatever perf counters your OS/CPU offer to get an idea of what changes as your dataset grows.

Question 4

Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?

Maybe. It depends if your system is NUMA¹. This can cause slowdown when one core has to ask another to access its memory

^{1: Non-uniform memory access}

Question 5

A system with 40 cores and 1TB is definitely NUMA. However, the memory is already split up (probably) 5 or 10 or 20 ways. Using more cores means each core is closer to some memory and farther away from some.

Question 6

Regardless of NUMA, there is also the possibility to be bottlenecked by memory bandwidth.

Question 7

Matrix multiplication with a TB of data, memory bandwidth can easily be your bottleneck. You just have 40 CPUs, probably nicely vectorised code... That will put lots of pressure on your memory subsystem. And the better your code, the higher the pressure.

I would start by making sure that you are partitioning the work into chunks that can be handled inside your cache on each CPU. Splitting each matrix into 256 chunks < 2 MB. You might check how much work per dollar you get out of a M1 Mac. Not that much RAM, but tons of bandwidth (800GB per second), very fast SSD / virtual memory and a lot cheaper than any 1TB machine.

PS your caches may have problems if the distance between rows is a power of two. A 4096x4096 matrix could be a very bad idea. If that’s what you have, try changing it to 4100 x 4096 for example.

Question 8

I bet you the 1TB machine also has 800GB per second memory bandwidth or more.

Question 9

But it also has an enormous price tag. Dell or Apple machines with 1.5TB are between 20,000 and 30,000 dollars. And 800 GB is an awful lot. I wouldn’t bet on it until I have seen the spec.

Question 10

It really depend on the implementation.

A very important part of optimization is to utilize the caches efficiently. While L1 and L2 is typically per core, L3 is often shared. In the absolute worst case, the increased amount of memory requests could cause L3 cache lines to be evicted just before they are needed again.

A well optimized implementation should try to load a section of data that fits into the L1 cache, and do as much computation as possible before loading the next chunk. And ideally do the same with the L2 cache. This typically results in almost unreadable code, but it can give huge performance gains.

So the first thing I would do is check the library. I would expect common and popular libraries like numpy to perform well. Or libraries where the main selling point is performance, like Intel performance primitives. I would expect internal or smaller libraries to perform worse since optimization requires a huge time investment.

workstation-class processors typically also scale the amount of cache and memory channels with the number of cores. If you have a 40 core processor you likely have far more bandwidth available than a single core will use. 4-8 cores per memory channel seem fairly common today.

A well optimized library should make optimal usage of the resources available. In some cases manual tuning parameters might help, but this will likely be a trial and error process. But multiplying huge matrices is just fundamentally slow, so you will likely not solve your problem of poor scaling.

Question 11

Just an idea: If your code is multithreaded (I bet it is), you may be able to control how many threads are used. And determine the optimal number.

Question 12

My Mac at home: L1 is per core. L2 is per group of 4 cores. L3 is shared between all cores and GPUs. And fast memory with 16 byte lines.

pjc50 pjc50 15.3k1 gold badge37 silver badges40 bronze badges · Answer 1 · 2022-11-08 12:01:40Z

2

Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?

Maybe. It depends if your system is NUMA¹. This can cause slowdown when one core has to ask another to access its memory

^{1: Non-uniform memory access}

Share

Improve this answer

edited Nov 9, 2022 at 15:03

Laiv's user avatar

Laiv

15k2 gold badges34 silver badges71 bronze badges

answered Nov 8, 2022 at 12:01

pjc50's user avatar

pjc50 pjc50

15.3k1 gold badge37 silver badges40 bronze badges

2

1

A system with 40 cores and 1TB is definitely NUMA. However, the memory is already split up (probably) 5 or 10 or 20 ways. Using more cores means each core is closer to some memory and farther away from some.

Stack Exchange Broke The Law
– Stack Exchange Broke The Law

2022年11月08日 14:03:37 +00:00
Commented Nov 8, 2022 at 14:03
1

Regardless of NUMA, there is also the possibility to be bottlenecked by memory bandwidth.

Stack Exchange Broke The Law
– Stack Exchange Broke The Law

2022年11月08日 14:05:17 +00:00
Commented Nov 8, 2022 at 14:05

Add a comment |

gnasher729 gnasher729 49.3k4 gold badges71 silver badges137 bronze badges · Answer 2 · 2022-11-08 21:41:09Z

Matrix multiplication with a TB of data, memory bandwidth can easily be your bottleneck. You just have 40 CPUs, probably nicely vectorised code... That will put lots of pressure on your memory subsystem. And the better your code, the higher the pressure.

I would start by making sure that you are partitioning the work into chunks that can be handled inside your cache on each CPU. Splitting each matrix into 256 chunks < 2 MB. You might check how much work per dollar you get out of a M1 Mac. Not that much RAM, but tons of bandwidth (800GB per second), very fast SSD / virtual memory and a lot cheaper than any 1TB machine.

PS your caches may have problems if the distance between rows is a power of two. A 4096x4096 matrix could be a very bad idea. If that’s what you have, try changing it to 4100 x 4096 for example.

I bet you the 1TB machine also has 800GB per second memory bandwidth or more.
But it also has an enormous price tag. Dell or Apple machines with 1.5TB are between 20,000 and 30,000 dollars. And 800 GB is an awful lot. I wouldn’t bet on it until I have seen the spec.

JonasH JonasH 6,27122 silver badges21 bronze badges · Answer 3 · 2022-11-09 16:19:04Z

It really depend on the implementation.

A very important part of optimization is to utilize the caches efficiently. While L1 and L2 is typically per core, L3 is often shared. In the absolute worst case, the increased amount of memory requests could cause L3 cache lines to be evicted just before they are needed again.

A well optimized implementation should try to load a section of data that fits into the L1 cache, and do as much computation as possible before loading the next chunk. And ideally do the same with the L2 cache. This typically results in almost unreadable code, but it can give huge performance gains.

So the first thing I would do is check the library. I would expect common and popular libraries like numpy to perform well. Or libraries where the main selling point is performance, like Intel performance primitives. I would expect internal or smaller libraries to perform worse since optimization requires a huge time investment.

workstation-class processors typically also scale the amount of cache and memory channels with the number of cores. If you have a 40 core processor you likely have far more bandwidth available than a single core will use. 4-8 cores per memory channel seem fairly common today.

A well optimized library should make optimal usage of the resources available. In some cases manual tuning parameters might help, but this will likely be a trial and error process. But multiplying huge matrices is just fundamentally slow, so you will likely not solve your problem of poor scaling.

Just an idea: If your code is multithreaded (I bet it is), you may be able to control how many threads are used. And determine the optimal number.
My Mac at home: L1 is per core. L2 is per group of 4 cores. L3 is shared between all cores and GPUs. And fast memory with 16 byte lines.

Stack Exchange Network

Memory on multiple cores versus 1 core

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Memory on multiple cores versus 1 core

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions