Created: 2016年01月23日
Some complementary material on
interpreter directives
and
command name issues
can be found in Wikipedia
You can also read the email in which Dennis Ritchie introduced #!.
The command name for any Unix script must be stable for any complex system based on it to be stable. However, this is being compromised through practices based on misinformation. † This paper explores how scripts are actually run, how naming affects correctness and stability, and various common misconceptions in order to clarify the reasons behind standard practice - which is:
Command names should never have filename extensions.
Command name extensions have numerous issues:
Ironically all of these are problems involving interpretation by humans.
Herein, a problem with filename extensions is described in a manner perhaps more pragmatic than, yet inspired by, the well known Go To Statement Considered Harmful by Edsger W. Dijkstra (Communications of the ACM, Vol. 11, No. 3, March 1968). Dijkstra's work addresses the issue of how the use of the go to statement largely abridges the ability to parametrically describe the progress of a process, engendering an unnecessary impediment to the code's clarity and manageability. This new document details, based on practical experience under Unix-like operating systems, how filename extensions, particularly but not limited to those files implementing commands, create a secondary set of semantic tags in the interfaces between programs which are demonstrably both superfluous and treacherous.
It's not a coincidence that in both Dijkstra's plaint and this one that computers are not at all affected by either practice - it's entirely a problem for just the humans.
For purposes of this paper, command names are the filenames of all the executable files in the directories in the Unix $PATH environment variable.
By convention, almost all such directories end in bin
(nominally suggestive for tool bin
and not restricted to binaries
),
with sbin (system bin
), and games also being common.
Historically etc or lib occasionally appeared in $PATH, but this has become increasingly rare since the 1980s.
Consider the following examples, in which files have
.sh and .py
extension, ostensibly to indicate the type of the file as well as to
make it easy to list all files of the same type (shell scripts).
Running them based on the apparently-correct interpreter doesn't go well
(the [...]
means truncated for brevity):
bison's implementation of yacc.
Theorysection in the
Pythonappendix).
These scripts show some problems around trusting extensions:
Failures aren't always as obvious as immediately exiting with an error: More subtle distinctions in script language execution, or a script with sufficient error trapping to survive being run with the wrong interpreter version, could result in incorrect results and serious damage.
(The same issue can arise through command search in $PATH finding a different version of a program than expected, especially when using virtual environments, but that's outside of the scope of this document)
Several mechanisms exist to determine how a file should be executed, whether as a set of directives or as machine code. The ones relating to this discussion are:
Interpreter directives can only be changed by modifying the files' contents, whereas file extensions can be changed arbitrarily using general filesystem commands like mv. File extensions also have a disturbing tendency to get lost in some contexts, since they're part of a Unix directory entry, not part of the file itself. In contrast, interpreter directives are quite stable. With scripts, interpreter directives are typically changed in the same manner as the other contents through using a text editor. Modern editors can usually recognize scripts by their interpreter directive, although historically special handling of certain types of text files was usually done based on the file extension.
Now, so far, command name extensions might look like no more than hints to editors to use the correct editing mode, or to humans to make it easy to ls by script type.. The kernel doesn't view them specially at all - they're only just more bytes in the filename. But there is an insidious problem with them, in that using them breaks part of the mechanism by which the implementation details are hidden from the user, and from other programs written by users. It's the humans' attempt to apply the information in these command name extensions that causes problems.
Typically, programs in Unix often start their lives as quickly written, inefficient, under-featured shell scripts. Later, they get converted to something faster, like PERL or python. Finally, they are often rewritten C, C++, or something else fully compiled. If the author violates encapsulation by exposing the underlying language in a spurious extension, the command name may change from a name.sh, to name.pl, to name, breaking all existing coded calls to the program each time, as well as adding to the cognitive load of human users. The more effective the user base has been at script-based factoring and reuse, the more treacherous the extensions become (ie. proficient users often build more readily on preëxisting programs, increasing the number of dependencies on the names of those programs).
To combat the problem of breaking dependencies, what usually happens is that when the name.sh script ends up being rewritten in (for example) PERL, the now-misleading old name is retained to keep from breaking other programs which refer to it. The resulting mismatch causes extra maintenance hassles principally to users trying to maintain the extensions, who naïvely type things like ls -l *.sh without realizing some of the listed files aren't shell scripts anymore. Such semantic dissonance leads easily to more serious issues, with scripts called by the wrong interpreters in error-suppressed contexts, truncated processing due to the resulting errors, and the resulting arbitrarily disastrous problems.
The issue of using the wrong interpreter can be subtle, since a user
seeing a name.py program may enter
python name.py, not realizing that the program
only works with python 2.5 when 2.4 is still the system default
(the former would have a directive like #!/usr/bin/python2.5).
Most scripts suffixed with .sh
on Linux are actually bash
scripts, and many version of Unix don't include bash, just
the real /bin/sh (no arrays, no $(...), no <(...), etc).
Scripts also often make delicate use of interpreter directives to
have the PATH used or ignored, or special options passed in,
none of which is capturable in a primitive filename extension.
There are cases where scripts are executed as a result of special extensions, such as the model currently used by most webservers where file handling is cued by filename extensions. However, even such subsystems often have other, more sophisticated approaches allowing those same extensions to be hidden, and thus protect URIs from a variant of the script filename extension problem, namely, how to keep all links to your website from breaking with you switch from *.html files to *.cgi, *.php, or something else. Furthermore, of the extensions just listed, note that .html files aren't scripts, .php files use a webserver builtin, and that .cgi scripts themselves require interpreter directives to be executed correctly as well as the .cgi for Apache to permit the script to be run.
Rely on interpreter directives instead or some other paradigm that prevents the implementation from being exposed, or worse yet, lied about, within the very name of the command. The best place is in the file itself, though as noted, there are some issues to deal with through #!/usr/bin/env and other tactics.
So you have this file named foo.py...
If foo
is a library with a unittest activated by being run
with python foo.py or even as just ./foo.py, that's
okay - that's not a command that would live in $PATH.
However, if it's a full program with no library aspirations,
the .py is clearly wrong. I've almost never seen
a foo.py in the $PATH, since such hacks usually end
up littering top level Flask directories and placed in the
$PYTHONPATH (for python libraries) instead.
There's a case where, in some bin/ directory, there are both a foo.py implementing a library, and a foo implementing the options parsing and using library. In this situation the foo is executable and the foo.py isn't, and because the .py isn't this situation is fine (though rare).
As an example, here's a library hellolib.py and a program hi.py just as described above (save for the names):
There's no point in being able to run ./hellolib.py or python hellolib.py, because we're obviously just going to run nosetests hellolib anyway, as per standard practice. Otherwise, we'd have to add the rather ugly, though accepted, lines below:
...which a bit nasty, since we'd have to either add execute permission on the library file too as well as a #! line, or guess at which version of Python is needed to run it manually, e.g. python hellolib.py Also, enabling execute permission makes nosetests's decision of whether it's safe to import the file (without causing side effects) much harder, so it doesn't test executable files default, and we risk the unittest in our library being skipped.
The issue of users wanting to be able to list, for example, all Bourne shell scripts easily with ls(1) is a big motivator to some people to name them all with .sh extensions. If ls had an option to filter based on the execution method of a file, say something like ls -e '*/sh' to list only files with /sh at the end of the first part of the interpreter directive, that would help. However, whether ls should even be doing such a job would probably be hotly, justifiably contested.
Here's an example of using a new program to address this problem:
A sample script implementing the command (obviously with no extension in case someone wants to rewrite it in Python, Ruby, C, etc.). Note that this needs #!/bin/bash specifically, since classic /bin/sh doesn't support $(...) or local.
Obviously we can reimplement scripts in any language we want without telling any of its other users, because it doesn't have some [expletive deleted] extension on the end, and so for everyone else it'll just keep working.
This is... a theory.
In the late 1980s (based my experience at the time) , commandname extensions were essentially absent from the Unix realm. Almost all scripting was either in Bourne shell, or in the Csh a few screwballs (included myself and others) tried to make work as a scripting language. Ksh, Tcsh, and a few others were used at some sites. Interpreter directives were required for all of them except Bourne shell scripts, since sh would attempt to execute a executable script via the kernel, but if that failed it would just assume it was an sh script (they ALL were a decade before, so it made some sense), and spawn a shell to interpret it, which worked badly when the script was actually written in any of the other things.
In the 1990s, commandname extensions showed up occasionally when DOS/Windows users started poking at Linux and dragging along the DOS extension concept with them. However, DOS hides filename extensions - you can run a DOS script even if the extension is omitted when invoking it - so in theory they were hiding metadata (and, coincidentally, creating an inroad for Trojan attacks) instead of exposing the implementation language. In contrast, Unix requires the entire name of the file to run commands - including any extensions (or a string of them) since they're just more characters - the . isn't special to the kernel, just part of the name. Essentially the DOS practice is totally wrongheaded in the Unix environment. Fortunately, during this period more experienced Unix users tended to educate the DOS arrivals soon enough to keep the practice from being all that common.
In the 2000s, and increasingly in 2010 and beyond, there was a sudden explosion in commandname extensions, but not from the DOS migrants, but rather from a new sub-population of programmers in languages like PHP, PERL (to some extent), Python, Ruby and others - all languages which were NOT compiled, and whose libraries tend to require extensions, and whose users typically had little to no grounding in Unix fundamentals, and hadn't worked in C (which produces executables without extensions most of the time). These programmers improperly overgeneralized the use of extensions from libraries to command scripts, and then wrote lots of documentation that included this aberrant practice. And now, suddenly they're everywhere, doing it wrong while thinking it's right (that what the docs say, after all), and driving those who actually know how it works slightly insane.
So now we the insane ones are writing little webpages like this to tell the interpreted-language crowd, please, please be more sparing in your extensions. They don't belong on commands. Really. Ever. Every time you mutilate a command by putting an extension on it, some angry computing god out there kills a kitten.
Please - think of the kittens.
Cogito ergo spud (I think therefore I yam).