Comments on this post from Reddit

MoonsOfJupiter: 2016年09月12日 14:09:10

Looks like he's basing his floating point costs off of the old x87 instructions (FMUL) that almost nobody uses. Even for scalar operations you're generally better off using SSE or AVX instructions such as MULSS (32 bit float) or MULSD (64 bit float); on Haswell these both have a reciprocal throughput of .5 cycles.

no-bugs: 2016年09月12日 15:54:33

You're right (and I've added those, thanks) - however, the "order of magnitude" estimate for real-world calculations (which are usually much closer to "latency" metric than to "reciprocal throughput" - especially for relatively short ops - and "latency" is 5 for Haswell and 3 for Broadwell) didn't change for multiplications. On the other hand, for FP divisions the estimate might indeed become a bit different, I need to think how to show it properly.

Ono-Sendai: 2016年09月12日 18:38:42

See https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=div_ss&expand=2080,2077,2080

div_ss has a latency of <= 14 cycles.

Ono-Sendai: 2016年09月12日 18:39:21

Also main RAM reads can take much more than 150 cycles.

Ono-Sendai: 2016年09月12日 18:41:12

Also i have measured virtual function calls to be much faster: about 8 cycles for predicted calls, 22 cycles for unpredicted.

no-bugs: 2016年09月13日 03:52:10

Can you elaborate when exactly? Do you mean page faults, TLBs, or something else?

Ono-Sendai: 2016年09月14日 23:13:14

I mean the total latency for reading from main memory can be quite high when a large amount of memory is used, e.g. 300 or 400 cycles when you get up to 10GB or more. I think what happens is that you start trying to access virtual memory pages that aren't in the TLB. This means the TLB entries themselves have to be fetched from main Memory, which can take a while.

no-bugs: 2016年09月15日 04:29:01

Yeah, TLB is one thing which can affect it; will think how to incorporate it into the big picture...

valarauca8: 2016年09月12日 17:08:20

It should be mentioned that using likely()/unlikely()/__builtin_expect() isn't a great idea. If you are wrong ~30% of the time. The resulting pipeline stalls are a net-performance loss not gain source(LWN Dec 2010).

Generally speaking don't do branch hinting, if you feel you need to test, measure, and ensure there aren't other simpler optimizations you can make. If are going to add branch prediction, generally profile ahead of time to ensure your hints are correct>80% of the time.

minno: 2016年09月13日 02:55:04

IIRC the main benefit is for cache, not for branch prediction. The compiler has the choice to translate

A;
if c1 {
 B;
} else {
 C;
}
D;

 A
 if !c1 goto else
 B
 goto endif
else:
 C
endif:
 D

 A
 if c1 goto if
 C
 goto endif
if:
 B
endif:
 D

or even

 A
 if !c1 goto else
 B
endif:
 D
else:
 C
 goto endif

since the ordering (ABCD, ACBD, ABDC) influences which blocks will be in the instruction cache after A is executed. If c1 is very likely, it's better to use the ABDC version even though it involves two jumps and a definite icache miss if block C is executed, since the usual ABD path is straight-line and jump-free, and prefetching will almost certainly prevent cache misses.

no-bugs: 2016年09月13日 04:56:17

Yep, I've changed it (and here is the link on it too: http://lwn.net/Articles/255364/ )

Hellrazor236: 2016年09月12日 20:03:43

I would recommend using them for error-checking due to speed not really mattering when shit's going wrong.

valarauca8: 2016年09月12日 20:46:57

Depends how often shit goes wrong.

FireCrack: 2016年09月12日 20:52:42

Hmm, you recommend one thing, but your numbers seem to suggest a different course of action. There are many cases where I can be wrong far less than 30% of the time.

valarauca8: 2016年09月12日 21:04:27

If you read the comment section of the linked article they actually derive useful engineering calculations based on pipeline flush time vs percentage error

SkoomaDentist: 2016年09月13日 09:25:08

Sometimes you do know beforehand that a branch is likely / unlikely due to the inherent algorithm. A typical example would be edge cases that need to be handled for correctness but you want to avoid slowing the common path. Another would be error checking where the error handling itself is already costly enough that extra slowdown would have no practical effect on the overall timing.

derpderp3200: 2016年09月12日 22:48:28

Lats resource I read on the topic implied that it's basically always a bad idea unless it's profiler-backed or you're wrong like 0.1% of the time or less.

phire: 2016年09月12日 22:10:07

The cost for the direct c function call looks too large, I'd like to see the working on that.

In an ABI which allows you to pass the first 4-8 arguments through registers, your c function call basically maps to some register moves (which are more or less free), the call and then the ret. And call/ret pairs are super optimised so they normally predict the correct return operation. There is nothing stopping you executing the ret before the return address finishes writing to the stack.

Likewise, the other function calls are probably too high.

Edit: Oh I scrolled down and found the citation. It's an estimate from a 17 year old c++ book. On a Pentium, with an older ABI that pushes all arguments to the stack, 25 cycles is a reasonable minimum.

But it's horrendously outdated today.

no-bugs: 2016年09月13日 03:56:59

It is indeed outdated, but I don't have any better ref ATM :-(. Also - don't forget about the costs of push/pop (registers need to be freed/restored for the new function - that is, if it does some useful work) - and these are pretty large. Also, from what I've seen recently - 20-30 cycles for a not-purely-getter function looks as a good guesstimate.

ThisIs_MyName: 2016年09月12日 20:03:52

Don't forget the cost of process-switching and killing the TLB: https://en.wikipedia.org/wiki/Translation_lookaside_buffer#Address_space_switch

renozyx: 2016年09月13日 13:20:15

I thought that modern TLBs had "address space identifier" so you didn't necessarily need to kill the TLB?

ThisIs_MyName: 2016年09月13日 17:37:01

Sure, but by the time process #2 finishes, it would have overwrote most entries used by process #1.

ThisFrickinSite: 2016年09月12日 23:36:10

For a more comprehensive list of cpu instruction timings on a per-chip-design basis, this is a good resource (and is cited by this article, but you might not have noticed): http://www.agner.org/optimize/instruction_tables.pdf

201109212215: 2016年09月13日 09:48:37

For those of you who want to connect this to other paradigms (with the caveat that these are tradeoffs):

50-120 cycles for virtual function calls: JITted languages will be able to inline those, reducing the cost to virtually zero cycles.
200-500 cycles for allocation+deallocation: allocation can be reduced to bumping a pointer (L2 latency, 10 cycles?) if you have large areas available. This happens with most garbage collectors and with arenas. Deallocation can be done concurrently outside the hot path for GC; and all at once with arenas.
10^4-10^6 cycles for Thread switching: one solution is to switch only what you need in the thread (stack+program counter). It reduces the cost to loading back this context into your cache (200 cycles for 2kB?). This is what happens under the hood with coroutines, continuations, lightweight threads, fibers, actors, Erlang processes, goroutines, Quasar, etc. It is all the same thing, and does wonders for highly concurrent workloads. Whatsapp is able to deal with 2.8 million (mostly-idle) concurrent sessions with this.

With the tradeoffs being:

JIT and AOT don't go well together (for the reason that when you deoptimize in JITs you want to backtrack to a specific site, which might have been compiled away during AOT). This means a JITted program will always start in interpreter mode. These can't hit the ground running. In this regard, AOT brings predictable performance and latency. I think there is research currently for having AOT that you can deoptimize to, but we won't be seeing it soon.
GC can stop the world. Ie you can't do anything on any core for the pause duration. This pause is being reduced, though (a new GC on the JVM, Shenandoah, is supposed to cap these pauses to tens of ms for GB-sized heaps)
Lightweight threads that you can switch around are independent compilation units. Call boundaries won't be optimized.

Paul_Dirac_: 2016年09月12日 20:38:33

AFAIK the intel branch prediction works as follows: When a branch is encountered and not in the branching buffer it is predicted that conditional forward branches are not taken, while conditional backward branches are taken. (roughly: an if statement is predicted as true. A loop is assumed to be followed multiple times.) The branch taken buffer has multiple entries per branch and can also predict changing branches for some specific branches. Lastly there are conditional instructions that do not count as branching. So small if statements and the ternary operator will probably not result in a branch(but cost 2-4 instructions). In general branching cost is strongly influenced by the compiler and the algorithms are good enough for most cases so that branch optimization should be regarded as premature optimization.

I am missing a bit the discussion of prefetchers and especially false sharing.

phire: 2016年09月12日 23:25:40

This was true in the Pentium Pro days, but it's provably better to just pick a random branch direction on unknown branches.

All intel CPUs since the Pentium 4 uses this scheme (and possibly more). Technically it's not pure random, it's based on the state of the branch that it's evicting, but it should be taken/not taken with about 50% probably.

minno: 2016年09月13日 02:56:31

it's provably better to just pick a random branch direction on unknown branches.

Why is random better than an arbitrary heuristic? Is the backwards->yes, forwards->no heuristic really wrong more often than not?

derpderp3200: 2016年09月12日 22:51:22

Threading idea: What if the cache(s) were split into many smaller segments, and only invalidated as needed, so threads with particularly predictable memory access patterns only cause minor amounts of invalidation?

Can someone with more knowledge tell me if this would work? (with hardware support probably, of course)

phire: 2016年09月12日 23:15:48

You are better off using all the all the cache for the single thread.

Cache is expensive and thread switches are actually reasonably uncommon compared to other operations. By default, window's scheduler only forces thread switches every 5-15ms, unless threads yield or block early.

The cache actually fills quite fast, and who knows if the thread is going to want the exact same data after it switches back.

derpderp3200: 2016年09月12日 23:37:20

You are better off using all the all the cache for the single thread.

Why? If you're just accessing memory sequentially from a few locations you don't need 12MB of cache, you'd probably be fine with a few kilobytes. Maybe it wouldn't do much in most cases, but sometimes it could save a lot of those 1M cycles.

phire: 2016年09月12日 23:50:47

Oh, I was assuming L1 cache, which is in the order of 32KB. With L3 cache you might actually see an advantage.

If you are streaming memory sequentially, there are special methods for accessing memory which skips the cache all together.

derpderp3200: 2016年09月13日 00:59:06

If you are streaming memory sequentially, there are special methods for accessing memory which skips the cache all together.

Oh, I didn't know about that. Could you expand on that?

AcceptingHorseCock: 2016年09月13日 08:50:13

Try starting with http://stackoverflow.com/questions/28684812/how-to-write-or-read-memory-without-touching-cache, many more similar questions on SO.

NasenSpray: 2016年09月13日 12:07:10

Cache partitioning

no-bugs: 2016年09月13日 05:19:13

By default, window's scheduler only forces thread switches every 5-15ms, unless threads yield or block early.

Actually, IIRC it is 100-200 ms.

jmickeyd: 2016年09月13日 08:18:37

Intel added something like this in the Haswell, and expanded it in Broadwell, called CAT (cache allocation technology). Basically you can divide L3 cache up and pick which cores get which percentage. That way the big, low priority task doesn't steal cache from the small, high priority task.

CODESIGN2: 2016年09月13日 20:35:48

Could it be noted somewhere that non-trivial programs will likely need to load floating point numbers from RAM or devices making the true cost RAM+device+fp cycle time. I mention this because a lot of people seem to talk about isolated operations (especially FPU ops) without thinking about a load time for said operations

kbumsik: 2016年11月12日 00:51:18

I am not an expert in x86, only used Cortex-M series microcontrollers. But does multiplication takes more than 1 cycles? Even tiny microcontroller takes 1-2 cycles for MUL operations. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/BABCFBDJ.html

Enlightenment777: 2016年11月13日 00:36:54

Every processor family is different. That's what bad about these types of articles that don't take various different processors into account. The graph would look very different for any ARM processor that doesn't have a cache or external memory.

nwmcsween: 2016年09月12日 19:25:02

This is sort of why Linux, Windows and basically any other punt-stuff-into-kernelspace is bound to die eventually (prob when 100gbps NICs become norm) as the cost of actually doing any work is miniscule compared to the overhead (this of course depends on what the syscall does). Most os's do nasty hacks to get around this by batching, vdso, etc but a better solution would probably be: a kernel that simply multiplexes resources, everything in userspace, some sort of whole program JIT (LTO might be able to do this) for really fast traces.

derpderp3200: 2016年09月19日 12:44:55

What do you mean by multiplexing resources? Can you overall expand on your concept?

nwmcsween: 2016年09月19日 19:15:29

Take barrel fish for example which is a multi-kernel exokernel. The many papers on exokernels can tell you more than I could.

derpderp3200: 2016年09月19日 19:52:11

I don't want to be told everything in detail, or spend dozens of hours on papers. I'm interested in overview-level knowledge and general facts and tidbits. Which I assume you could tell me much better than papers could :P

To comment in this thread, join this Reddit discussion.