Reading this interesting paper it seems that a lot of performance loss is due to scheduling overhead in tight loops. To recap: There's a variable called "Chunksize" which determines how big of a chunk (amount of iterations) each processing unit (effectively a thread) will claim to later execute in private context.
The problem is always that you don't really know how much time a chunk is going to take so there are multiple statistical approaches to solve this problem. However, there also exists this thing called PGO or profile guided optimization which is never mentioned in the paper. I was wondering if it is possible to empirically determine the execution time of a single task and work this variable back into code.
Effectively I want to have a bunch of variables in my code to be set by PGO based on the most optimal value for these. I know this is platform specific, but maybe you could also do this dynamically when optimizing for the host platform in some way. In any way, I just got my feet wet in this topic so I wanted to ask the experts. Is there already ongoing work in this area?
I could picture it to work like a genetic algorithm that tries out a bunch of variables and optimizes them as it goes along.
-
3The answer to "is it possible" is almost always "yes". The question you need to be asking is "is it useful" and that's not something we can answer; this is a research-level problem.Philip Kendall– Philip Kendall2023年04月04日 09:13:34 +00:00Commented Apr 4, 2023 at 9:13
1 Answer 1
PGO is not magic. The collected profiles can help the optimizer make more informed decisions about how the order in which branches are emitted, about whether to inline a function, and so on. This is possible because the profile lets us calculate empirical branch probabilities after executing a typical workload.
But such a profile doesn't let us infer the optimal value of tuning parameters. In some cases we can infer sensible changes from branch probabilities, but in the general case they can be difficult to interpret.
For example, a loop condition in a hot loop will typically have a very low branch probability for leaving the loop. We could argue from this that there are many iterations, so each iteration should do more work (e.g. by unrolling the loop a bit). On the other hand, such a condition is easy for the CPU to predict, so it doesn't represent a lot of overhead to start with.
In the OpenMP case, things are more tricky because the optimal batch size must balance the need to choose large batch sizes to keep communication overhead low with the need to choose small batch sizes to parallelize optimally. But this depends more on the network than on PGO-informed micro-optimizations.
So instead of relying on PGO, you'll probably have to tune the system using more conventional means. That means designing a series of benchmarks where you race different configurations against each other, and then decide whether one configuration is better in a statistically significant way. For comparing N fixed configurations, I recommend tools like hyperfine
. For having the configuration tool automatically explore a parameter space, consider tools like irace
(which samples configurations via evolutionary algorithms, as you suggest). If evaluating a candidate configuration is very expensive, Bayesian Optimization techniques can allow you to find the most promising candidate, at least in low-dimensional parameter spaces where you can have a useful distance metric.
In order to make experimentation easy, it's probably a good idea to keep the batchsize as a runtime-configurable parameter, rather than compiling it into the software. Even though compile-time configuration can enable better optimizations like loop unrolling, that tends to not be the constraining factor in distributed systems.
-
Thanks for your elaborate answer. I will definitely look into irace, never heard of it :)glades– glades2023年04月04日 14:38:27 +00:00Commented Apr 4, 2023 at 14:38
Explore related questions
See similar questions with these tags.