I know that the system call interface is implemented on a low level and hence architecture/platform dependent, not "generic" code.
Yet, I cannot clearly see the reason why system calls in Linux 32-bit x86 kernels have numbers that are not kept the same in the similar architecture Linux 64-bit x86_64? What is the motivation/reason behind this decision?
My first guess has been that a backgrounding reason has been to keep 32-bit applications runnable on a x86_64 system, so that via an reasonable offset to the system call number the system would know that user-space is 32-bit or 64-bit respectively. This is however not the case. At least it seems to me that read() being system call number 0 in x86_64 cannot be aligned with this thought.
Another guess has been that changing the system call numbers might have a security/hardening background, something I was not able to confirm myself.
Being ignorant to the challenges of implementation the architecture-dependent code parts, I still wonder how changing the system call numbers, when there seems no need (as even a 16-bit register would store largely more then the currently ~346 numbers to represent all calls), would help to achieve anything, other than break compatibility (though using the system calls through a library, libc, mitigates it).
-
3I think you are asking the wrong question. The correct question is why keep them the same: Answer compatibility. So if x86 and x86_64 are incompatible, then there are no forces to keep them from changing. Now all the forces from the last 20 years that wanted change, will dominate (we get a chance to change them). [Note this is just opinion and not based on the inner mind of the designers of the new system.]ctrl-alt-delor– ctrl-alt-delor2017年01月19日 16:28:12 +00:00Commented Jan 19, 2017 at 16:28
4 Answers 4
As for the reasoning behind the specific numbering, which does not match any other architecture [except "x32" which is really just part of the x86_64 architecture]: In the very early days of the x86_64 support in the linux kernel, before there were any serious backwards compatibility constraints, all of the system calls were renumbered to optimize it at the cacheline usage level.
I don't know enough about kernel development to know the specific basis for these choices, but apparently there is some logic behind the choice to renumber everything with these particular numbers rather than simply copying the list from an existing architecture and remove the unused ones. It looks like the order may be based on how commonly they are called - e.g. read/write/open/close are up front. Exit and fork may seem "fundamental", but they're each called only once per process.
There may also be something going on about keeping system calls that are commonly used together within the same cache line (these values are just integers, but there's a table in the kernel with function pointers for each one, so each group of 8 system calls occupies a 64-byte cache line for that table)
-
1
fork may seem "fundamental", but [...] called only once per process.
Uh, what? I understand you may expect to call exit once, but you may fork inside the parent and child of afork()
callcat– cat2017年01月20日 01:22:00 +00:00Commented Jan 20, 2017 at 1:22 -
5@cat if you view the
fork
as being accounted to the child process (i.e. view it as the process creation call), rather than the parent process then Random832's statement is correct.icarus– icarus2017年01月20日 02:35:35 +00:00Commented Jan 20, 2017 at 2:35 -
6@cat OK, you might call fork() two or three times, maybe a few more. But you may call read() millions or even billions of times.Michael Hampton– Michael Hampton2017年01月20日 03:18:28 +00:00Commented Jan 20, 2017 at 3:18
-
1Yes, that's what I meant. The number of fork calls and the number of processes over the lifetime of the system is going to be identical, ignoring details like init, clone [which can create processes or threads], etc.Random832– Random8322017年01月20日 13:11:42 +00:00Commented Jan 20, 2017 at 13:11
See that answer to the question "Why are the system call numbers different in amd64 linux?" on Stack Overflow.
To sum it up: for the sake of compatibility, the system call list is stable and can only grow. When the x86 64 architecture appeared, the ABI (argument passing, returned value) was different, thus the kernel developers took the opportunity to bring changes that had long awaited.
-
Cool my guess was correct.ctrl-alt-delor– ctrl-alt-delor2017年01月19日 16:35:09 +00:00Commented Jan 19, 2017 at 16:35
-
3That other answer you link to is speculative: it says "the Linux guys most likely decided..." (emphasis added). I think it would help if your answer here provided some indication that it is apparently based on speculation rather than evidence. Incidentally, a more recent comment posted under the linked answer provides evidence that the true reason is not generic cleanup of cruft (as that answer speculates), but is specifically about "cacheline usage", as explained in the other answer here.D.W.– D.W.2017年01月20日 11:37:33 +00:00Commented Jan 20, 2017 at 11:37
Firstly, the syscall numbers were chosen to be compatible with SysV UNIX running in the same platform, where possible.
Secondly, there is a common base set for everything other than i386, but there are minor discrepancies where additional generic syscalls were added, numbers overlapping with architecture-specific syscalls that had already been added.
The x86_32 and x86_64 architectures use the same syscall numbering, except where x86_32 requires the kernel to do extra address mapping.
The i386 syscall numbering is just a mess of multiple versions of the same thing with different parameter widths. For example stat
has at least 5 syscall numbers, allowing for 16/32/64-bit inode numbers, 16/32-bit uid & gid, 32/64/96-bit timestamps, and the addition of .st_btime
. Almost all the file-related syscalls have multiple versions to cope with these parameter widenings.
Many of the syscalls present in i386 are dropped entirely on newer platforms, replaced by library functions: getuid
and geteuid
use the getreuid
syscall; stat
, lstat
and fstat
use the fstatat
syscall; fork
& pthread_create
use the clone
syscall; etc.
Some architectures
In short, because somebody thought "N+1
gratuitously incompatible ways of doing it are better than N
ways". For historical archs, the syscall numbers were usually chosen to match some legacy proprietary unix. But for x86_64 the kernel developers were free to choose any numbering they liked. Rather than making the simple choice and reusing an existing numbering, they made the choice to invent a new standard. Then they did it again for aarch64 and a bunch of others. This is an oft-repeated pattern in Linux kernel development.
-
4The change was not gratuitous. There are solid technical reasons. If it weren't for backwards-compatibility requirements, similar changes would have been applied to the existing architectures as well.Jörg W Mittag– Jörg W Mittag2017年01月20日 09:54:06 +00:00Commented Jan 20, 2017 at 9:54
-
Difference in numbering is 100% gratuitous. There is no technical advantage to any particular numbering.R.. GitHub STOP HELPING ICE– R.. GitHub STOP HELPING ICE2017年01月20日 14:37:23 +00:00Commented Jan 20, 2017 at 14:37
-
2As this other answer explains, syscalls are grouped such that syscalls which are commonly used together share the same cacheline in the syscall table. And syscall numbers are chosen such that they are simple indices into that table. Theoretically, we could use a layer of indirection to decouple the position of a syscall in the syscall table from the syscall number, but that would possibly eat up a portion of the performance gains we get from putting hot syscalls in the same cacheline.Jörg W Mittag– Jörg W Mittag2017年01月20日 14:55:46 +00:00Commented Jan 20, 2017 at 14:55
-
@JörgWMittag: And that's obviously premature optimization and not a measurable improvement. Just look at how many cycles syscalls take and how many cache lines they evict. Saving at best one cache line from ordering of the table is not going to make a difference.R.. GitHub STOP HELPING ICE– R.. GitHub STOP HELPING ICE2017年01月20日 15:49:29 +00:00Commented Jan 20, 2017 at 15:49
-
2@R.. "I choosed the numbering in function of tpcc kernel profiling info with popular DBMS and strace output of some network and desktop application." certainly does sound like there were measurements. However the author supplied no numbers or adequately explained the methodology.user45891– user458912017年01月20日 16:09:09 +00:00Commented Jan 20, 2017 at 16:09