Archives
- October 2025
- September 2025
- August 2025
- July 2025
- June 2025
- May 2025
- April 2025
- March 2025
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- January 2011
- November 2010
- October 2010
- August 2010
- July 2010
Solaris 2.6, 7, and 8 Crashes on Pentium 4 and Later
A blog reader recently pointed to an interesting problem which affects older Solaris releases on certain systems. The symptoms (crash/reboot) may at first glance look like the previously described problem which affected Solaris 2.5.1 and 2.6, but both the cause and the set of affected systems are different.
Solaris 8 MCE TrapWhen systems based on the Pentium 4 started appearing in the early 2000s, users of several then-recent versions of Intel editions of Solaris discovered that Solaris could not be successfully booted (or installed) on Pentium 4 processors. The affected versions were Solaris 2.6 (1997), Solaris 7 (1998) and Solaris 8 (2000). On the other hand, Solaris 2.5.1 (1996) and older continued working; Solaris 9 (2003) was never affected.
The problem manifested itself as a “BAD TRAP” panic very early in the boot, often but not always accompanied by a triple fault/reboot. There was no easy way to avoid the problem, but there was a workaround which required a little bit of typing, and which was available thanks to the very helpful Solaris kernel debugger. Because the kernel debugger was available even on the installation media, it was entirely possible to engage the workaround, install the OS, and then patch the kernel.
The cause of the problem was somewhat careless coding on the part of Solaris kernel developers, combined with Intel’s ever-changing MSR (Model-Specific Register) implementation. Solaris 2.6 was the first to add support for Intel MCE, or Machine Check Exceptions.
The Bug
Intel’s MCE, introduced in the Pentium Pro (P6 microarchitecture), was an attempt to give an OS a chance to do something about hardware errors which were serious but not necessarily immediately fatal—parity errors, ECC failures, bus errors, cache problems. The CPU would generate a Machine Check Exception (MCE), a very high priority exception that the OS could handle and at least record the event, even if it wasn’t safe to continue.
MCE itself was an extension of the earlier MCA (Machine Check Architecture) introduced in the Intel Pentium processor. MCE was essentially a more advanced superset of MCA.
As it often happens with such features, they can easily cause more problems than they solve. Solaris 2.6/7/8 contained code to set up MCEs on the P6 family of processors. If the CPU reported MCA and MCE feature bits in CPUID, Solaris would run a setup_mca() routine early in the kernel start-up sequence (or during processor initialization for secondary processors).
The routine worked on P6 family CPUs as designed, but broke on the Pentium 4 (and later Intel CPUs) because Intel slightly changed the layout of MCE MSRs. The code in the Solaris kernel was supposed to take newer CPUs into account but due to a coding error it didn’t. On the Pentium 4, it would attempt an invalid MSR write and caused a #GP fault, which would panic the system.
The problem was of course fixed in updated Solaris releases. Solaris 8 Update 5 (officially designated as Solaris 8 7/01) was able to boot on a Pentium 4, and so did later Solaris 8 updates. For earlier releases, patch 108529-08 corrected the problem, but installing it on a Pentium 4 system of required kadb trickery as described above.
Sun’s bug numbers for the problem were 4408508, “setup_mca() has extra, faulty indirection; cases panic” and 4414557, “setup_mca: MSR definitions incorrect for Pentium 4, can’t boot”.
It’s not currently known whether any official patches were available for Solaris 2.6 or 7. Again, kadb patching worked on those releases.
This bug is one of the “no amount of testing would have caught that” category. There was simply no problem on the CPUs available at the time Solaris 2.6/7/8 was released, and only the newer Pentium 4 processor exposed the bug.
Note that the exact behavior depends on the specific CPU model and Solaris version. For example, Solaris 2.6 crashes on a Pentium 4 M while Solaris 8 FCS does not. On the other hand, Solaris 2.6 does not crash on a Core i7 (well, not because of MCE MSRs) while Solaris 8 FCS does. In both cases, the crash is caused by the setup_mca() routine and the resolution is the same.
The Workaround
What to do if Solaris 8 before U5 needs to be installed on a Pentium 4 or later system, or an older Solaris version for which no patch is available needs to be moved to newer hardware?
Fortunately, the Solaris kadb debugger makes it possible to patch the kernel and avoid the crash, either on an installed system or on the installation media. It is thus possible to install the unpatched OS onto an affected system and patch it after installation. The workaround is as follows:
- On the boot prompt, enter
b kadb -d. This will load kadb and break into the debugger (the-doption) before the kernel starts executing. - On the kadb[0] prompt, enter
setup_mca/w c3. This will patch a RET instruction at the beginning of thesetup_mcaroutine and prevent the buggy function from running. - Enter
:cto continue execution and boot/install the OS.
These steps need to be performed on every boot until the OS is patched. A sample invocation from Solaris 8 FCS is shown here:
Solaris 8 MCE Trap WorkaroundOnce the system is booted, it’s possible to either install an official Solaris patch or apply the workaround permanently to the installed kernel. To permanently patch the kernel, run the following at the command prompt (with superuser privileges):
echo 'setup_mca?w c3' | adb -w /platform/i86pc/kernel/unix
This is the equivalent of the kadb runtime patch (adb is the older userland sibling of kadb), except it modifies the installed OS kernel file on disk.
Virtual Systems
The problem may be visible in virtualized systems if they expose enough of the machine check exception MSRs—and if Solaris would crash were it to run on the host system directly; for example on AMD Opteron systems there is no problem.
VirtualBox 4.3 is one of such hypervisors. On the one hand, it’s nice that such a problem can be examined in a VM... on the other hand it’s a bit inconvenient that OS bugs are exposed.
The guest OS can be of course patched as above. The alternative is not exposing the MCA/MCE CPUID bits; since the guest OS will never see any machine checks, it doesn’t need to be ready to receive them. Clearing either the MCA or MCE bit should do the trick for Solaris; at least clearing only the MCA bit is known to work.
The MCA bit is bit 14 in register EDX in CPUID leaf 1. In VirtualBox, one might for example run VBoxManage list hostcpuids to query the host’s CPUID information, clear bit 14 in the last doubleword (EDX) of leaf 1 (that’s the second leaf), and tweak the VM. Supposing the host’s EDX value is bfebfbff, the guest needs to see bfebbbff instead:
VBoxManage modifyvm MySolarisVM --cpuidset 1 000206a7 06100800 1fbae3ff bfebbbff
Et voilà, unpatched Solaris 8 boots again:
Solaris 8 FCS InstallNot as fun as using a kernel debugger on an OS installation CD, but just as effective.
2 Responses to Solaris 2.6, 7, and 8 Crashes on Pentium 4 and Later
Did you get Solaris 2.6 working on an Intel Core i7 at all?
In a VM yes, on bare hardware no — because I was not going to install it anyway. I believe with some judicious kadb patching it should be possible though.
This site uses Akismet to reduce spam. Learn how your comment data is processed.