The code intends for the sem_post() in line 97 (now 98) to only unblock target threads waiting on line 29. But after the first thread is released, the next sem_post() might also unblock a thread waiting on line 36. That would cause the thread to return to the execution of user code before all threads are done, leading to user code being executed in a mixed-credentials environment. What's more, if this happens more than once, then the mass release on line 110 (now line 111) will cause multiple threads to execute the callback at the same time, and the callbacks are currently not written to cope with that situation. Adding another semaphore allows the caller to say explicitly which threads it wants to release.
a request for this behavior has been open for a long time. the motivation is that application code, particularly under some language runtimes designed around very-low-footprint coroutine type constructs, may be operating with extremely small stack sizes unsuitable for receiving signals, using a separate signal stack for any signals it might handle. progress on this was blocked at one point trying to determine whether the implementation is actually entitled to clobber the alt stack, but the phrasing "available to the implementation" in the POSIX spec for sigaltstack seems to make it clear that the application cannot rely on the contents of this memory to be preserved in the absence of signal delivery (on the abstract machine, excluding implementation-internal signals) and that we can therefore use it for delivery of signals that "don't exist" on the abstract machine. no change is made for SIGTIMER since it is always blocked when used, and accepted via sigwaitinfo rather than execution of the signal handler.
taking the deprecated/dropped vfork spec strictly, doing pretty much anything but execve in the child is wrong and undefined. however, these are commonly needed operations to setup the child state before exec, and historical implementations tolerated them. for single-threaded parents, these operations already worked as expected in the vforked child. however, due to the need for __synccall to synchronize id/resource limit changes among all threads, calling these functions in the vforked child of a multithreaded parent caused a misdirected broadcast signaling of all threads in the parent. these signals could kill the parent entirely if the synccall signal handler had never been installed in the parent, or could be ignored if it had, or could signal/kill one or more utterly wrong processes if the parent already terminated (due to vfork semantics, only possible via fatal signal) and the parent tids were recycled. in any case, the expected number of semaphore posts would never happen, so the child would permanently hang (with all signals blocked) waiting for them. to mitigate this, and also make the normal usage case work as intended, treat the condition where the caller's actual tid does not match the tid in its thread structure as single-threaded, and bypass the entire synccall broadcast operation.
the __synccall mechanism provides stop-the-world synchronous execution of a callback in all threads of the process. it is used to implement multi-threaded setuid/setgid operations, since Linux lacks them at the kernel level, and for some other less-critical purposes. this change eliminates dependency on /proc/self/task to determine the set of live threads, which in addition to being an unwanted dependency and a potential point of resource-exhaustion failure, turned out to be inaccurate. test cases provided by Alexey Izbyshev showed that it could fail to reflect newly created threads. due to how the presignaling phase worked, this usually yielded a deadlock if hit, but in the worst case it could also result in threads being silently missed (allowed to continue running without executing the callback).
this further reduces the number of source files which need to include libc.h and thereby be potentially exposed to libc global state and internals. this will also facilitate further improvements like adding an inline fast-path, if we want to do so later.
In all cases this is just a change from two volatile int to one.
commit 78a8ef47c4d92b7680c52a85f80a81e29da86bb9 inadvertently removed the SA_RESTART flag from the sigaction for the internal signal handler used by __synccall for broadcasting. as a result, programs which did not use interrupting signals but which used set*id() in a multithreaded context could wrongly observe EINTR errors they're not prepared to handle.
the memory model we use internally for atomics permits plain loads of values which may be subject to concurrent modification without requiring that a special load function be used. since a compiler is free to make transformations that alter the number of loads or the way in which loads are performed, the compiler is theoretically free to break this usage. the most obvious concern is with atomic cas constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be transformed to a_cas(p,*p,f(*p)); where the latter is intended to show multiple loads of *p whose resulting values might fail to be equal; this would break the atomicity of the whole operation. but even more fundamental breakage is possible. with the changes being made now, objects that may be modified by atomics are modeled as volatile, and the atomic operations performed on them by other threads are modeled as asynchronous stores by hardware which happens to be acting on the request of another thread. such modeling of course does not itself address memory synchronization between cores/cpus, but that aspect was already handled. this all seems less than ideal, but it's the best we can do without mandating a C11 compiler and using the C11 model for atomics. in the case of pthread_once_t, the ABI type of the underlying object is not volatile-qualified. so we are assuming that accessing the object through a volatile-qualified lvalue via casts yields volatile access semantics. the language of the C standard is somewhat unclear on this matter, but this is an assumption the linux kernel also makes, and seems to be the correct interpretation of the standard.
multi-threaded set*id and setrlimit use the internal __synccall function to work around the kernel's wrongful treatment of these process properties as thread-local. the old implementation of __synccall failed to be AS-safe, despite POSIX requiring setuid and setgid to be AS-safe, and was not rigorous in assuring that all threads were caught. in a worst case, threads late in the process of exiting could retain permissions after setuid reported success, in which case attacks to regain dropped permissions may have been possible under the right conditions. the new implementation of __synccall depends on the presence of /proc/self/task and will fail if it can't be opened, but is able to determine that it has caught all threads, and does not use any locks except its own. it thereby achieves AS-safety simply by blocking signals to preclude re-entry in the same thread. with this commit, all known conformance and safety issues in set*id functions should be fixed.
the main motivation for this change is to remove the assumption that the tid of the main thread is also the pid of the process. (the value returned by the set_tid_address syscall was used to fill both fields despite it semantically being the tid.) this is historically and presently true on linux and unlikely to change, but it conceivably could be false on other systems that otherwise reproduce the linux syscall api/abi. only a few parts of the code were actually still using the cached pid. in a couple places (aio and synccall) it was a minor optimization to avoid a syscall. caching could be reintroduced, but lazily as part of the public getpid function rather than at program startup, if it's deemed important for performance later. in other places (cancellation and pthread_kill) the pid was completely unnecessary; the tkill syscall can be used instead of tgkill. this is actually a rather subtle issue, since tgkill is supposedly a solution to race conditions that can affect use of tkill. however, as documented in the commit message for commit 7779dbd2663269b465951189b4f43e70839bc073, tgkill does not actually solve this race; it just limits it to happening within one process rather than between processes. we use a lock that avoids the race in pthread_kill, and the use in the cancellation signal handler is self-targeted and thus not subject to tid reuse races, so both are safe regardless of which syscall (tgkill or tkill) is used.