Archives
- April 2025
- March 2025
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- January 2011
- November 2010
- October 2010
- August 2010
- July 2010
From the Annals of Preprocessor Hackery
Over the last few days I’ve been slowly attacking the source code for 386MAX, trying to build the entire product. One of the many problems I ran into turned out to be quite interesting.
There are several (16-bit) Windows components in 386MAX, and many have some sort of GUI. As is common, dialogs etc. are built from templates stored in resource (.rc) files. But… trying to process many of the resource files with the standard Windows 3.1 resource compiler fails:
The problem is dialog text labels composed of multiple strings. Something like this:
LTEXT "Qualitas 386MAX\nVersion " "8" "." "03",-1, 12,7,112,17
For obvious reasons, the authors wanted to automatically update the strings with the current version numbers, and used macros to build those strings. Only that doesn’t quite work in the resource compiler.
But Why Not?
In C and C++, this is not a problem. String literals are merged together, thus the following are equivalent:
"Hello" " " "World" "Hel" "lo World" "Hello World"
But that does not happen in the resource compiler. We may wonder why not, but the fact is that it doesn’t.
The catch is that although the resource compiler (RC.EXE) uses a preprocessor (RCPP.EXE) which is essentially identical to a C compiler preprocessor (in fact almost certainly built from the same source code), a C preprocessor does not perform string literal merging. Again we may wonder why not, but the fact is that it doesn’t.
The upshot is that if the resource compiler expects a string, it must be supplied with a single string literal, because multiple consecutive string literals will not be merged.
A Quality Hack
For building 386MAX, Qualitas solved the problem in a manner that is as clever as it is dirty. Qualitas wrote a tool called RCPPP, called a “RCPP postprocessor”. The way it was used was as follows: The original resource compiler preprocessor (RCPP.EXE) had to be renamed to _RCPP.EXE, and the Qualitas RCPPP.EXE need to be copied to RCPP.EXE.
When the resource compiler (RC.EXE) was run, the Qualitas wrapper would pass its arguments the original preprocessor; after the preprocessor was finished, the wrapper would re-open the temporary file holding the preprocessor output, rewrite it to merge string literals, and then return control to the resource compiler. Voila, problem solved.
This was a very clever but nasty solution, because it required modifying the vendor tools (a big no-no). It was likely done that way because the Microsoft resource compiler does not offer any way to only run the preprocessor.
A Less Hacky Approach
There would have been a less hacky approach available, at the cost of makefile complication. The RCPP.EXE preprocessor can of course be executed as a standalone tool, and the preprocessed output could then be further rewritten to merge string literals.
Or, one might take advantage of the fact that the resource compiler preprocessor is a C preprocessor, and just use the C compiler to do the preprocessing.
Either approach requires multiple steps (preprocess, postprocess, run resource compiler) but does not need modifying the tools. Using the C compiler to preprocess additionally does not need relying on the internals of the resource compiler.
What Is Even RCPP.EXE?
Raymond Chen says: The Resource Compiler’s preprocessor is not the same as the C preprocessor, even though they superficially resemble each other.
I would rate that claim as misleading. In reality, the resource compiler’s preprocessor is very much a C preprocessor, with minor differences. I should add that the following applies to the Windows 3.1 Resource Compiler, which may not be quite like the newer NT based resource compilers.
A quick look at RCPP.EXE reveals that not only it very much is like a C preprocessor, it is more or less identical to the first phase of a Microsoft C compiler. Which, based on the strings included in it, does a lot more than just preprocessing.
Here is a screenshot of a handful of error messages from the Microsoft Windows 3.1 Resource Compiler preprocessor (RCPP.EXE):
For comparison, here’s a screenshot of error messages from the C1.ERR file corresponding to the first phase (C1.EXE) of the Microsoft C 5.1 compiler:
The similarity is not coincidental, and it is far more than superficial (even though the strings are not identical). Also note that most of the error messages apply to a C language compiler, not just a preprocessor.
I would guess that the Windows 3.1 RCPP.EXE is built from the source code for the first pass of the Microsoft C compiler, circa version 5.1 (other versions have noticeably different error messages, while version 5.1 is a close match). The similarities go far enough that, for example, the command line of the preprocessor child process (C1.EXE/RCPP.EXE) is in both cases passed in an environment variable called MSC_CMD_FLAGS
(bypassing the DOS 128 character command line length limit).
It should therefore not be surprising that RCPP.EXE and the C preprocessor behave almost identically. Consider the following:
rcpp -DRC_INVOKED -Ic:\msvc\include -E -g foo.i -f my.rc cl -DRC_INVOKED -Ic:\msvc\include -E my.rc > foo.i
Both produce nearly identical output, the only difference being slashes and backslashes in #file
directives.
As an aside, the RCPP.EXE shipped with the Windows 1.x and Windows 2.x SDKs seems to be a very close relative of the first phase of the Microsoft C 3.0 compiler (P1.EXE); RCPP.EXE is identical between the Windows 1.x and 2.x SDK versions. For Windows 3.0, the preprocessor was upgraded with the one from Microsoft C 5.1 (or something quite close), and stayed unchanged for Windows 3.1.
A Different Solution
Or… instead of massaging the preprocessor output, perhaps there is a way to avoid the problem entirely?
The stringize operator (#
) of the C preprocessor can be used to turn preprocessing tokens into a single character string literal. This approach requires separate machinery because the preprocessing tokens must not be string literals—otherwise extraneous double quotes end up in the output.
Using the C (or RC) preprocessor in this manner is not exactly intuitive, and understanding how and why it works requires a fairly deep understanding of the mechanics of the preprocessor. For all intents and purposes, the C preprocessor is a completely different language from C.
Suppose we want to produce a string like “386MAX Version 8.03”. In the original resource file, it was achieved as follows:
#define VER_MAJOR_STR "8" #define VER_MINOR_STR "03" #define VERSION VER_MAJOR_STR "." VER_MINOR_STR ... LTEXT "386MAX Version " VERSION, ...
It’s simple enough, except (as explained above), it doesn’t work because the resource compiler does not concatenate string literals.
The resource compiler compatible version is rather more involved, and the first attempt might look something like this:
#define VER_MAJOR 8 #define VER_MINOR 03 #define VER_MKSTR(s) #s #define VER_STR(s) VER_MKSTR(s) #define VER_PV_RC(m, n) VER_PRODVER_RC m.n #define VER_PVSTR_RC VER_STR(VER_PV_RC(VER_MAJOR, VER_MINOR)) ... #define VER_PRODVER_RC 386MAX Version ... LTEXT VER_PVSTR_RC, ...
Note that instead of VER_MAJOR_STR, we must use VER_MAJOR. Also note that if we want the version to be displayed as 8.03 rather than 8.3, VER_MINOR must be defined as 03 rather than 3.
Therein lies the first pitfall. If we wanted to set the version to 8.08, we might define VER_MINOR as 08. That would work nicely for the resource compiler, but not in C language arithmetic… because 08 is not a valid octal constant, and neither is 09. (If that does not make sense, you just do not know the C language well enough.) It is simple enough to define separate macros (say VER_MINOR_RC) for the preprocessor, should the need arise.
There are other pitfalls. Suppose we want to use the company name in the string:
#define VER_PRODVER_RC Qualitas, Inc. 386MAX Version
Now, it just so happens that the above works as users likely expect in the Windows 3.1 Resource Compiler preprocessor, but only because the Microsoft preprocessor is strange.
In more or less any other C preprocessor, the above doesn’t work. The comma causes too many arguments to be passed to the VER_STR macro when processing the VER_PVSTR_RC macro. Some compilers (e.g. Watcom, IBM) warn and throw away the comma and everything past it. Other compilers (Borland) error out and do not accept the input. Other compilers (gcc) behave yet differently, not expanding a function-like macro with too many arguments at all.
The C90 standard (relevant for the Windows 3.1 Resource Compiler) is clear:
The number of arguments in an invocation of a function-like macro shall agree with the number of parameters in the macro definition, and there shall exist a
C90 Standard, section 6.8.3)
preprocessing token that terminates the invocation.
It is an error to use a comma in this context. Fortunately there is an easy workaround:
#define VER_PRODVER_RC Qualitas\x2C Inc. 386MAX Version
Instead of a comma character, we can use a hexadecimal escape sequence with the ASCII code for a comma (2Ch) to achieve the desired result.
This workaround takes advantage of the fact that to the preprocessor, \x2C is just a random sequence of four characters (backslash, x, 2, C). Only in a later phase of translation does the escape code get converted into a single character, together with classics like \n or 0円.
Conclusion
At any rate, it is possible to use the preprocessor to produce a single string literal acceptable to the resource compiler. It is not exactly straightforward, primarily because the preprocessor is a rather different beast from the C language proper, but it is doable.
The bottom line is that it is no longer necessary to massage the preprocessor output, and it is certainly not necessary to hack the resource compiler itself to insert an extra processing stage. The unmodified resource compiler now produces the desired output, at the cost of a bit of extra baggage largely hidden away inside one header file.
25 Responses to From the Annals of Preprocessor Hackery
Another workaround would be to put the numbers in hexadecimal 🙂
But yeah, “0 means octal” is one of C’s warts.
The end result looks suspiciously like… what was it called…
ah, SYSEDIT (a program which me’s never found very useful).
No, hex won’t help, because if you do ‘#define VER_MINOR 0x08’ then you will end up with
"386MAX Version 8.0x08"
.The editor is more or less a regular plain text editor, but I guess it was meant for quick editing of configuration files, at least the way it was shipped.
As an aside, I have seen more than one bug report complaining that this or that C compiler does not handle a number like 0123 correctly. Little do they know…
The hex comment was a joke, sorry.
Even SYSEDIT is basically an MDI version of Notepad, isn’t it?
While the octal convention likely came from the PDP-7 — the 36-bit word
of which was commonly used to store six 6-bit characters –, with two
octal digits representing a character much like two hexadecimal digits
representing an 8-bit character (byte) now, me has no idea why “0” was
chosen as a prefix. (Fast-forward a century or two, and people will
start tripping over the fact that “x” is a valid digit in higher
bases…)
In some implementations of K&R C, octal 8 (08) and octal 9 (09) were accepted as their decimal values. ANSI cleaned that up but broke code in the process. I remember that some compilers allowed keeping the looser K&R syntax.
Yes, octal numbers made sense back in the day, so I can see why they wanted them in C.
Using a leading zero to designate octal numbers was simply a bad choice though, to anyone who knows basic arithmetic it makes no sense that 10 and 010 are different numbers.
Now I almost wonder if they meant to use O instead of zero and someone got it mixed up 🙂
I don’t think ANSI had much choice really, the K&R implementations were divergent enough that not breaking any existing code wasn’t an option. At least in this case the newly unacceptable constants would have been flagged and people could easily fix it.
You could be on to something there: at [0], it is mentioned that a
phone dictation service at Bell Labs returned “i-node” as “eye node”;
something similar may well have happened with “oh” versus “zero”.
[0] https://arstechnica.com./gadgets/2019/08/unix-at-50-it-starts-with-a-mainframe-a-gator-and-three-dedicated-researchers/3/ [1]
[1] Me’d actually swear that meread this elsewhere before, but now
mecan’t find it.
@zeurkous
It is in a paper from Dennis Ritchie, you can find it in the preserved site of Dennis Ritchie at Nokia (Nokia bought Bell Labs).
The system was that you called an extension inside Bell Labs and recorded your voice message, later a secretary transcribe it and the next day was in your desk. The secretary in this instance didn’t know the technical terms so wrote what she(he) understood.
@Fernando: that’s the zeroth place me looked, since me’d also swear it
was there, but even if it is, me’s unable to find the specific paper.
@zeurkous
I misremember it, I was sure that was from that papers, but no, it’s from an interview with Thomson, you can find it here:
https://www.tuhs.org/Archive/Documentation/OralHistory/expotape.htm
Ah yeah, me probably read it there, thanks.
If they’d used “o”/”O” as the octal prefix, we’d just be annoyed that we can’t have variables named “o123” while “a123”, “b123”, etc. is fine.
Ideally it would be consistent with the other base specifiers, i.e. 0x and 0b (currently C++ only). “0o” is a bit difficult to read, but “0c” would work.
The restrictions on variable names in line number BASIC were attributable to limits on memory. C which started on machines with huge amounts of memory should have variables that involve actual words. Variable names comprised of a letter followed by a number is just a recipe for confusion.
It always did seem odd that an alleged transcription error was the reason for having a number instead of a letter for the octal prefix. Enough other errors were corrected over time in the language specification.
According to [0], the memory limits were on the compiler, so there
likely wasn’t room for something like full syntactic separation of
{variable name,literal}s (or at least, without doing it in a clumsy
manner), which would be a more final solution [1].
[0] https://bell-labs.co./who/dmr/primevalC.html
[1] Of course, even now, n years later, that development still hasn’t
happened. “Good enough” is one thing, but as the decades go by, that
position is starting to look a *little* extreme…
As for lack of correction of the “oh” “error”: it likely bit so few
people that it wasn’t deemed worth the effort to fix. Me’ll speculate
that this was to be fixed in a direct successor to C (which never quite
appeared; just like a direct UNIX successor never quite appeared).
@Richard Wells
“C which started on machines with huge amounts of memory should have variables that involve actual words.”
Don’t forget that for many early C compilers, only the first 6 characters of an identifier were “significant”. Even the ANSI (C89) standard doesn’t require more than that, at least when referring to identifiers defined in external translation units.
There are also, of course, mathematical and scientific formulae for which it’s natural to use variable names like “O1” (e.g. where the mathematical notation might be θ1; that’s theta-subscript-1 in case of any Unicode issues). It’s generally a good idea to keep the source code close to the mathematical notation for ease of verification.
I wouldn’t call PDP-11 (where C originated) "a system with huge amount of memory". You would get 64KB for one user if you are lucky.
And six characters limitation was all too real for years after C90 standard was finalised.
It’s a bit harder to explain why C++ retained so many limitations of C, but I guess the idea was to slowly adapt the language and change it, I don’t think Bjarne expected ossification to happen so quickly.
At least early on, there were no C++ compilers, there was Cfront which generated C code. So the limitations were hard to sidestep, because everything had to go through a regular C compiler and a linker. Even much later there were C++ compilers that translated C++ into C.
Agree that the early C machines were quite resource limited, certainly not as much as a PDP-7 but C didn’t assume mainframe class resources.
Most PDP-11 models* had 128K for each user; 64KB for code and 64KB for data. That is much more than the 4K or 16K available to many of the micro line-number BASICs that used variable names with a maximum of 2 characters.
* Okay, there was the PDP-11/03 using the LSI-11 chipset that was limited to a total of 64KB which arrived in 1976, long after C was established. With mini-Unix, that meant 4 KW for I/O addressing, 12 KW for OS, and 16 KW for a user program. C compilers weren’t fitting in that space for long.
According to this classic, you might be overstating the resources available to the early C compilers. The first PDP-11 used in the time when B turned into C only had 24K bytes memory total according to the paper. The PDP-11 models available before circa 1975 had quite limited memory; the late 1970s models were certainly much bigger and could handle megabytes of RAM.
Recognizable Unix and C show up with Fourth Edition Unix which only ran on the 18-bit addressing PDP-11/45. Replacing core memory with MOS or bipolar memory did increase the memory that could be placed in a system and Unix quickly occupied all of that. Unix required the memory manager and the memory manager meant the system had at least 128K.
Are your changes to ease building 386MAX being made public?
They will be when I have something to publish. Unfortunately not yet.
Perhaps a stupid question, but:
Is it even necessary to run the rc preprocessor rather than running the c preprocessor on the rc files?
Stu: Having 0o to indicate octal numbers would be great for object oriented code. (this is a really bad joke) 🙂
Richard Wells: I know that I’m really in minority, but my opinion is that local variables should preferably have meaningless names, like either “a1” or nonsense names like “carrot” or “saxophone”. My reasoning is that it makes it easier to identify that a variable is a local temporary variable rather than something “larger”. Also there is no need to understand what such variables do unless you read the code, and if you read the code you might aswell read it good enough that you understand what the variables do. Kind of.
Also, a super hot take is that loads of code written during the last decade(s) does absolutely nothing. There are tons of glue functions that call glue functions, and functions that combines calling a glue function with adding a parameter to another parameter or something trivial that could had been done at the place that function gets called.
I remember a time when it was actually possible to read code and understand what it does without opening the same source code file in half a dozen windows scrolled to different places to follow the glue function steps… I may exaggerate a bit, but still.
This site uses Akismet to reduce spam. Learn how your comment data is processed.