Comments on this post from Reddit

MoonsOfJupiter:

Looks like he's basing his floating point costs off of the old x87 instructions (FMUL) that almost nobody uses. Even for scalar operations you're generally better off using SSE or AVX instructions such as MULSS (32 bit float) or MULSD (64 bit float); on Haswell these both have a reciprocal throughput of .5 cycles.

no-bugs:

You're right (and I've added those, thanks) - however, the "order of magnitude" estimate for real-world calculations (which are usually much closer to "latency" metric than to "reciprocal throughput" - especially for relatively short ops - and "latency" is 5 for Haswell and 3 for Broadwell) didn't change for multiplications. On the other hand, for FP divisions the estimate might indeed become a bit different, I need to think how to show it properly.

Ono-Sendai:

See https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=div_ss&expand=2080,2077,2080

div_ss has a latency of <= 14 cycles.

Ono-Sendai:

Also main RAM reads can take much more than 150 cycles.

Ono-Sendai:

Also i have measured virtual function calls to be much faster: about 8 cycles for predicted calls, 22 cycles for unpredicted.

no-bugs:

Can you elaborate when exactly? Do you mean page faults, TLBs, or something else?

Ono-Sendai:

I mean the total latency for reading from main memory can be quite high when a large amount of memory is used, e.g. 300 or 400 cycles when you get up to 10GB or more. I think what happens is that you start trying to access virtual memory pages that aren't in the TLB. This means the TLB entries themselves have to be fetched from main Memory, which can take a while.

no-bugs:

Yeah, TLB is one thing which can affect it; will think how to incorporate it into the big picture...

valarauca8:

It should be mentioned that using likely()/unlikely()/__builtin_expect() isn't a great idea. If you are wrong ~30% of the time. The resulting pipeline stalls are a net-performance loss not gain source(LWN Dec 2010).

Generally speaking don't do branch hinting, if you feel you need to test, measure, and ensure there aren't other simpler optimizations you can make. If are going to add branch prediction, generally profile ahead of time to ensure your hints are correct>80% of the time.

minno:

IIRC the main benefit is for cache, not for branch prediction. The compiler has the choice to translate

A;
if c1 {
 B;
} else {
 C;
}
D;

as

 A
 if !c1 goto else
 B
 goto endif
else:
 C
endif:
 D

or

 A
 if c1 goto if
 C
 goto endif
if:
 B
endif:
 D

or even

 A
 if !c1 goto else
 B
endif:
 D
else:
 C
 goto endif

since the ordering (ABCD, ACBD, ABDC) influences which blocks will be in the instruction cache after A is executed. If c1 is very likely, it's better to use the ABDC version even though it involves two jumps and a definite icache miss if block C is executed, since the usual ABD path is straight-line and jump-free, and prefetching will almost certainly prevent cache misses.

no-bugs:

Yep, I've changed it (and here is the link on it too: http://lwn.net/Articles/255364/ )

Hellrazor236:

I would recommend using them for error-checking due to speed not really mattering when shit's going wrong.

valarauca8:

Depends how often shit goes wrong.

FireCrack:

Hmm, you recommend one thing, but your numbers seem to suggest a different course of action. There are many cases where I can be wrong far less than 30% of the time.

valarauca8:

If you read the comment section of the linked article they actually derive useful engineering calculations based on pipeline flush time vs percentage error

SkoomaDentist:

Sometimes you do know beforehand that a branch is likely / unlikely due to the inherent algorithm. A typical example would be edge cases that need to be handled for correctness but you want to avoid slowing the common path. Another would be error checking where the error handling itself is already costly enough that extra slowdown would have no practical effect on the overall timing.

derpderp3200:

Lats resource I read on the topic implied that it's basically always a bad idea unless it's profiler-backed or you're wrong like 0.1% of the time or less.

phire:

The cost for the direct c function call looks too large, I'd like to see the working on that.

In an ABI which allows you to pass the first 4-8 arguments through registers, your c function call basically maps to some register moves (which are more or less free), the call and then the ret. And call/ret pairs are super optimised so they normally predict the correct return operation. There is nothing stopping you executing the ret before the return address finishes writing to the stack.

Likewise, the other function calls are probably too high.

Edit: Oh I scrolled down and found the citation. It's an estimate from a 17 year old c++ book. On a Pentium, with an older ABI that pushes all arguments to the stack, 25 cycles is a reasonable minimum.

But it's horrendously outdated today.

no-bugs:

It is indeed outdated, but I don't have any better ref ATM :-(. Also - don't forget about the costs of push/pop (registers need to be freed/restored for the new function - that is, if it does some useful work) - and these are pretty large. Also, from what I've seen recently - 20-30 cycles for a not-purely-getter function looks as a good guesstimate.

ThisIs_MyName:

Don't forget the cost of process-switching and killing the TLB: https://en.wikipedia.org/wiki/Translation_lookaside_buffer#Address_space_switch

renozyx:

I thought that modern TLBs had "address space identifier" so you didn't necessarily need to kill the TLB?

ThisIs_MyName:

Sure, but by the time process #2 finishes, it would have overwrote most entries used by process #1.

ThisFrickinSite:

For a more comprehensive list of cpu instruction timings on a per-chip-design basis, this is a good resource (and is cited by this article, but you might not have noticed): http://www.agner.org/optimize/instruction_tables.pdf

201109212215:

For those of you who want to connect this to other paradigms (with the caveat that these are tradeoffs):

With the tradeoffs being:

Paul_Dirac_:

AFAIK the intel branch prediction works as follows: When a branch is encountered and not in the branching buffer it is predicted that conditional forward branches are not taken, while conditional backward branches are taken. (roughly: an if statement is predicted as true. A loop is assumed to be followed multiple times.) The branch taken buffer has multiple entries per branch and can also predict changing branches for some specific branches. Lastly there are conditional instructions that do not count as branching. So small if statements and the ternary operator will probably not result in a branch(but cost 2-4 instructions). In general branching cost is strongly influenced by the compiler and the algorithms are good enough for most cases so that branch optimization should be regarded as premature optimization.

I am missing a bit the discussion of prefetchers and especially false sharing.

phire:

This was true in the Pentium Pro days, but it's provably better to just pick a random branch direction on unknown branches.

All intel CPUs since the Pentium 4 uses this scheme (and possibly more). Technically it's not pure random, it's based on the state of the branch that it's evicting, but it should be taken/not taken with about 50% probably.

minno:

it's provably better to just pick a random branch direction on unknown branches.

Why is random better than an arbitrary heuristic? Is the backwards->yes, forwards->no heuristic really wrong more often than not?

derpderp3200:

Threading idea: What if the cache(s) were split into many smaller segments, and only invalidated as needed, so threads with particularly predictable memory access patterns only cause minor amounts of invalidation?

Can someone with more knowledge tell me if this would work? (with hardware support probably, of course)

phire:

You are better off using all the all the cache for the single thread.

Cache is expensive and thread switches are actually reasonably uncommon compared to other operations. By default, window's scheduler only forces thread switches every 5-15ms, unless threads yield or block early.

The cache actually fills quite fast, and who knows if the thread is going to want the exact same data after it switches back.

derpderp3200:

You are better off using all the all the cache for the single thread.

Why? If you're just accessing memory sequentially from a few locations you don't need 12MB of cache, you'd probably be fine with a few kilobytes. Maybe it wouldn't do much in most cases, but sometimes it could save a lot of those 1M cycles.

phire:

Oh, I was assuming L1 cache, which is in the order of 32KB. With L3 cache you might actually see an advantage.

If you are streaming memory sequentially, there are special methods for accessing memory which skips the cache all together.

derpderp3200:

If you are streaming memory sequentially, there are special methods for accessing memory which skips the cache all together.

Oh, I didn't know about that. Could you expand on that?

AcceptingHorseCock:

Try starting with http://stackoverflow.com/questions/28684812/how-to-write-or-read-memory-without-touching-cache, many more similar questions on SO.

NasenSpray:

Cache partitioning

no-bugs:

By default, window's scheduler only forces thread switches every 5-15ms, unless threads yield or block early.

Actually, IIRC it is 100-200 ms.

jmickeyd:

Intel added something like this in the Haswell, and expanded it in Broadwell, called CAT (cache allocation technology). Basically you can divide L3 cache up and pick which cores get which percentage. That way the big, low priority task doesn't steal cache from the small, high priority task.

CODESIGN2:

Could it be noted somewhere that non-trivial programs will likely need to load floating point numbers from RAM or devices making the true cost RAM+device+fp cycle time. I mention this because a lot of people seem to talk about isolated operations (especially FPU ops) without thinking about a load time for said operations

kbumsik:

I am not an expert in x86, only used Cortex-M series microcontrollers. But does multiplication takes more than 1 cycles? Even tiny microcontroller takes 1-2 cycles for MUL operations. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/BABCFBDJ.html

Enlightenment777:

Every processor family is different. That's what bad about these types of articles that don't take various different processors into account. The graph would look very different for any ARM processor that doesn't have a cache or external memory.

nwmcsween:

This is sort of why Linux, Windows and basically any other punt-stuff-into-kernelspace is bound to die eventually (prob when 100gbps NICs become norm) as the cost of actually doing any work is miniscule compared to the overhead (this of course depends on what the syscall does). Most os's do nasty hacks to get around this by batching, vdso, etc but a better solution would probably be: a kernel that simply multiplexes resources, everything in userspace, some sort of whole program JIT (LTO might be able to do this) for really fast traces.

derpderp3200:

What do you mean by multiplexing resources? Can you overall expand on your concept?

nwmcsween:

Take barrel fish for example which is a multi-kernel exokernel. The many papers on exokernels can tell you more than I could.

derpderp3200:

I don't want to be told everything in detail, or spend dozens of hours on papers. I'm interested in overview-level knowledge and general facts and tidbits. Which I assume you could tell me much better than papers could :P

To comment in this thread, join this Reddit discussion.

AltStyle によって変換されたページ (->オリジナル) /