Archives
- April 2025
- March 2025
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- January 2011
- November 2010
- October 2010
- August 2010
- July 2010
Stack Checking on OS/2
A while ago I was involved in debugging a seemingly simple yet mysterious problem:
A piece of code (a fairly simple interface DLL) built with the Open Watcom compiler was failing with a bogus stack overflow error. The mystery was that this failure only happened on OS/2 Warp Connect. It didn’t happen on OS/2 2.0 or Warp Server for e-Business (WSeB) or MCP2. And it also didn’t happen on Warp Connect updated to FixPack 40.
That’s weird, right? And getting to the bottom of the faulty stack check was a bit of a journey…
The Watcom run-time library checks the stack for overflow at function entry, unless stack checking is disabled with the -s
switch. For executable modules, the logic is simple enough. The run-time is in control of initialization and of creating new threads, so it’s aware of which stack is where (with a caveat explained below).
For dynamically linked libraries (DLLs) it’s bit more involved. The run-time linked into the DLL has no control over what threads already exist in the process when it’s loaded, or what additional threads might be created at runtime.
As it turns out, OS/2 has a handy mechanism that allows each thread to discover where its stack begins and ends. The TIB (Thread Information Block) contains two fields named tib_pstack
and tib_pstacklimit
, which store the addresses of the stack bottom and top, respectively (NB: The stack starts near the top and grows toward the bottom address). The DosGetInfoBlocks
API allows each thread to obtain the address of the TIB.
The Watcom runtime tries to catch the situation where the stack pointer (ESP register) moves below the stack bottom before it happens. The runtime actually does not call DosGetInfoBlocks
; instead, it takes advantage of the fact that on process startup, OS/2 loads the FS register with selector 150Bh, and that selector maps the TIB of the current thread. I am not entirely sure where this factoid is documented, but at minimum it is stated in the OS/2 Debugging Handbook – Volume IV System Diagnostic Reference.
When I started debugging the crashes, I could see that for example on Warp Server for e-Business, the TIB contained exactly the expected values for the stack base and limit (i.e. bottom and top). On Warp Connect, it did not. In fact on Warp Connect, the stack base address looked decidedly wrong. Problem solved?
Well… no. Because although the stack base looked wrong, it was lower than expected. That should have caused the opposite problem—instead of spurious stack overflow errors, a real stack overflow would not be detected. And in fact the stack base that the Watcom runtime stored during initialization was higher than the expected stack bottom. Which explained the bogus errors, but where did that value come from?
As it turns out, the stack base was determined right when the DLL was loaded, in the DLL initialization entry point. And that was exactly the problem.
For reasons that are not entirely clear, on some versions of OS/2 the DosLoadModule
API switches to a different stack when running the DLL initializers, and that is reflected in the TIB. I could not quickly find this fact documented anywhere, but it certainly happens at least on Warp Connect.
The difference is that on Warp Connect GA, the DLL initialization stack is at a higher address than the regular thread stack, at least for the failing program. On WSeB and most other OS/2 versions, that is not the case.
This led to a realization that the Watcom DLL run-time cannot reliably obtain the thread stack base during initialization. Or rather it can… but it’s the wrong stack.
The obvious solution is to not store the stack base during initialization but rather get it from the TIB whenever it’s needed. Whether that actually works in practice is currently an open question.
What Do Others Do?
While researching the problem, I thought I’d see what the IBM compilers do. Usually that is a reliable way to figure out the “right” way of doing things. The answer was rather unsatisfactory: nothing.
The IBM compiler can (and does by default) generate stack probes which ensure that the stack “guard page” is not skipped and stack memory is committed as needed. But there is no mechanism to check for stack overflows.
Out of desperation I thought I’d check if the old Microsoft 32-bit compiler shipped with OS/2 2.0 pre-releases did anything useful. But it didn’t. Although the run-time does include a __chkstk
routine, which performs stack checking in Microsoft’s 16-bit compilers, the 32-bit OS/2 variant only probes the stack in __chkstk
and does not make any attempt to guard against stack overflows.
Version Differences, DLL Initialization Stack
As mentioned above, there are version specific behaviors related to loading DLLs. As far as I can establish, if a DLL is loaded at load time (that is, an executable directly imports from a DLL) then the DLL initialization is run using the stack of the main executable’s first (and at that point, only) thread.
If DosLoadModule
is used to load DLLs at run-time, OS/2 2.0, 2.1, and even 2.11 SMP simply uses the stack of the calling thread. On Warp 3 or Warp Connect GA, as well as Warp 4 GA, DosLoadModule
runs the DLL initialization on some kind of custom stack. On WSeB, IBM reverted to the old behavior used in OS/2 2.0.
Not only that, but the old behavior was also restored in Warp 3 and Warp 4 FixPacks. On Warp 4, it changed in FP7 (June 1998). I have not established exactly when the change happened on Warp 3, but it is known that Warp 3 FP40 went back to the old behavior as well.
I could not quickly find any documentation of this change and its eventual reversal. I can only speculate that IBM implemented the change (switching to a separate stack for DosLoadModule
) for a reason, but it is likely that the change had undesirable side effects and IBM reverted it again after 2 or 3 years.
Version Differences, Stack Limits
There is another, unrelated change in OS/2 behavior which affects the stack base (that is, the bottom of the stack) reported in the TIB for the main thread.
Note that any secondary threads are started through DosCreateThread
and their stack size is explicitly specified as a parameter to the API call. For the main thread (i.e. thread 1) that is not the case.
The initial SS:ESP is specified in the executable header, as is the stack size. By convention, the stack is located in the data segment, past any statically allocated data.
On WSeB and later, the main thread’s TIB reflects the executable header in a very logical manner: The stack top corresponds to the initial SS:ESP, and the stack base is the stack top minus stack size.
However, on older versions (OS/2 2.x as well as Warp 3 and Warp 4 GA) that is not the case. The stack top in the TIB is reported identically, but the stack base is not. The older OS/2 versions appear to completely ignore the stack size in the executable header and set the stack base to the start of the segment that the stack is in.
That is rather undesirable for stack checking: Typically there is data below the stack, which means that using the information from the TIB will not catch overwriting the data segment… which is exactly what the stack checking tries to prevent!
LX Format Vagaries
When exploring this sub-optimal behavior, a likely reason quickly came to light: Old versions of IBM’s linker (LINK386.EXE) didn’t bother setting the stack size in the executable header at all!
That is, LINK386.EXE always had a /STACK
switch which allowed the user to specify the stack size at link time. This influenced the initial ESP location, but the stack size was not recorded in the executable header at all. It was always set to zero.
Older OS/2 version therefore didn’t even bother looking at the stack size in the executable LX header and always set the stack base in the TIB to the bottom of the segment the stack was in. This was more or less guaranteed to be wrong, but it was probably the most reasonable guess OS/2 could make.
The odd behavior can be traced to the tooling change from Microsoft to IBM and switching from the LE to LX format. Microsoft planned to change the LE format and OS/2 so that even the main thread’s stack could use the guard page mechanism, which required knowing exactly how big the stack was.
But this change probably never materialized. There are old versions of the LX format specification which do not show the stack size field (a DWORD at offset 0ACh) in the LX header at all. Newer versions of the LX specification show the stack size field and make no mention that it might not be there, but also clearly state that the stack size in the LX header may be zero (in which case the stack size cannot be determined).
Likewise the EXE386.H header in the OS/2 2.0 Toolkit (March 1992) has no stack size member in the e32_exe
struct. In the OS/2 2.1 Toolkit (March 1993) the e32_stacksize
member is present and the number of reserved bytes is reduced accordingly.
The behavior of LINK386 changed as well, and at some point LINK386 started recording the stack size in the executable header. Changing OS/2 itself took longer, and happened between the Warp 4 (1996) and WSeB (1999) releases. The new behavior is arguably how things should have always been, except the original LX header definition and the early 32-bit linker didn’t allow it to work.
On OS/2 Warp 4, FP6 (March 1998) introduced the exact stack size reporting in TIB that later appeared in WSeB. The loader looks at the reported the stack size and uses it to calculate the exact stack base. If the executable reports zero stack size in the LX header, the loader reverts to the earlier behavior and uses the base of the segment holding the stack instead.
It should be noted that Watcom’s own wlink
also didn’t set the stack size in the LX header for a long time, probably not until version 11.0c. That is unsurprising since Watcom developed OS/2 2.0 support quite early (before the stack size was defined in the LX format), and even though the stack size was present in the LX header since OS/2 2.1, it continued to be ignored by the OS until after the release of Warp 4.
Why Even Bother?
For secondary threads, running out of stack space is normally accompanied by crashes whose source may be more or less easily recognized. The problem child is the main, and often only, thread of a program.
By convention, stack is located at the very top of the data segment, growing downwards. The problem is that at the bottom of the segment, there’s static initialized and uninitialized data, and it’s all writable memory. If the stack erroneously grows downwards beyond its preset limits, it will corrupt some innocent data. Such a problem may be extremely difficult to debug, especially if the stack does not grow uncontrollably (unlike cases of infinite recursion, when the program will crash soon enough) and only causes “minor” corruption, and only occasionally.
Sure, programmers can use a big stack, but no stack is big enough to guard against all possible programming errors. That is why stack checking is useful, because it can prevent a type of error that is otherwise often very difficult (and expensive) to diagnose.
But doing it correctly on OS/2 is clearly not trivial, especially when DLLs are involved!
This site uses Akismet to reduce spam. Learn how your comment data is processed.