0

Reading this interesting paper it seems that a lot of performance loss is due to scheduling overhead in tight loops. To recap: There's a variable called "Chunksize" which determines how big of a chunk (amount of iterations) each processing unit (effectively a thread) will claim to later execute in private context.

The problem is always that you don't really know how much time a chunk is going to take so there are multiple statistical approaches to solve this problem. However, there also exists this thing called PGO or profile guided optimization which is never mentioned in the paper. I was wondering if it is possible to empirically determine the execution time of a single task and work this variable back into code.

Effectively I want to have a bunch of variables in my code to be set by PGO based on the most optimal value for these. I know this is platform specific, but maybe you could also do this dynamically when optimizing for the host platform in some way. In any way, I just got my feet wet in this topic so I wanted to ask the experts. Is there already ongoing work in this area?

I could picture it to work like a genetic algorithm that tries out a bunch of variables and optimizes them as it goes along.

asked Apr 4, 2023 at 7:02
1
  • 3
    The answer to "is it possible" is almost always "yes". The question you need to be asking is "is it useful" and that's not something we can answer; this is a research-level problem. Commented Apr 4, 2023 at 9:13

1 Answer 1

3

PGO is not magic. The collected profiles can help the optimizer make more informed decisions about how the order in which branches are emitted, about whether to inline a function, and so on. This is possible because the profile lets us calculate empirical branch probabilities after executing a typical workload.

But such a profile doesn't let us infer the optimal value of tuning parameters. In some cases we can infer sensible changes from branch probabilities, but in the general case they can be difficult to interpret.

For example, a loop condition in a hot loop will typically have a very low branch probability for leaving the loop. We could argue from this that there are many iterations, so each iteration should do more work (e.g. by unrolling the loop a bit). On the other hand, such a condition is easy for the CPU to predict, so it doesn't represent a lot of overhead to start with.

In the OpenMP case, things are more tricky because the optimal batch size must balance the need to choose large batch sizes to keep communication overhead low with the need to choose small batch sizes to parallelize optimally. But this depends more on the network than on PGO-informed micro-optimizations.

So instead of relying on PGO, you'll probably have to tune the system using more conventional means. That means designing a series of benchmarks where you race different configurations against each other, and then decide whether one configuration is better in a statistically significant way. For comparing N fixed configurations, I recommend tools like hyperfine. For having the configuration tool automatically explore a parameter space, consider tools like irace (which samples configurations via evolutionary algorithms, as you suggest). If evaluating a candidate configuration is very expensive, Bayesian Optimization techniques can allow you to find the most promising candidate, at least in low-dimensional parameter spaces where you can have a useful distance metric.

In order to make experimentation easy, it's probably a good idea to keep the batchsize as a runtime-configurable parameter, rather than compiling it into the software. Even though compile-time configuration can enable better optimizations like loop unrolling, that tends to not be the constraining factor in distributed systems.

answered Apr 4, 2023 at 10:32
1
  • Thanks for your elaborate answer. I will definitely look into irace, never heard of it :) Commented Apr 4, 2023 at 14:38

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.