-
Notifications
You must be signed in to change notification settings - Fork 218
@nathanaelhuffman
Description
Here's a snip from the cosmo_seq that shows the issue, I've included from the inital power on just to set the baseline.
TOTAL VARIANT
13942 RegStateValues
534 ContinueBitstreamLoad
112 CPUPresent
3 SetState(InitialPowerOn)
1 SetState(Overheat)
1 FpgaInit
1 WaitForDone
1 Programmed
1 Startup
1 SequencerInterrupt
1 Coretype
1 PmbusAlert
NDX LINE GEN COUNT PAYLOAD
...
62 468 6 1 SetState { prev: Some(A0), next: A0PlusHP, why: InitialPowerOn, now: 0x231fa }
63 442 6 97 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(Done) }, nic_raw_status: NicRawStatusView { hw_sm: 0x6 } }
64 442 6 1 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(NicReset) }, nic_raw_status: NicRawStatusView { hw_sm: 0x6 } }
65 442 6 171 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(Done) }, nic_raw_status: NicRawStatusView { hw_sm: 0x6 } }
66 741 6 1 SequencerInterrupt { our_state: A0PlusHP, seq_state: Ok(Done), ifr: IfrView { fanfault: false, thermtrip: false, smerr_assert: false, a0mapo: false, nicmapo: false, amd_pwrok_fedge: false, amd_rstn_fedge: false, fan_central_hsc_alert: false, fan_east_hsc_alert: false, fan_west_hsc_alert: false, ibc_alert: false, m2_hsc_alert: false, nic_hsc_alert: false, v12_ddr5_abcdef_hsc_alert: false, v12_ddr5_ghijkl_hsc_alert: false, v12_mcio_a0hp_hsc_alert: false, main_hsc_alert: false, vr_v1p8_sys_to_fpga1_alert: false, vr_v3p3_sys_to_fpga1_alert: false, vr_v5p0_sys_to_fpga1_alert: false, v0p96_nic_to_fpga1_alert: false, pwr_cont1_to_fpga1_alert: true, pwr_cont2_to_fpga1_alert: false, pwr_cont3_to_fpga1_alert: false } }
67 781 6 1 PmbusAlert { now: 0x64b4d }
68 442 6 56 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: 0x0 } }
69 468 6 1 SetState { prev: Some(A0PlusHP), next: A2, why: Overheat, now: 0x72435 }
70 442 6 1 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Disabling) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xf }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: 0x0 } }
71 442 6 2 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Idle) }, seq_raw_status: SeqRawStatusView { hw_sm: 0x0 }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: 0x0 } }
The sequence goes like this:
- At NDX 64, we're up and happy, in A0HP, the nic domain is up.
- The 12V rail droops due to a BMR491 dropout (mitigated in newer code than was running on this target).
- The 12V rail droop causes some pmbus alerts, and we expect the T6's MAX5970 to UVLO, the pmbus alert is seen in NDX 66/67.
- At NDX 68, it is clear that the FPGA has MAPO'd the nic domain as we're back in IDLE. I expect this should have set the nic mapo IFR, but we don't see indication of that in this trace. See https://github.com/oxidecomputer/quartz/blob/11f53ccfd9bcab5d823a8f4f51bd3b183481b999/hdl/projects/cosmo_seq/sequencer/sims/sp5_seq_sim_tb.vhd#L77 and https://github.com/oxidecomputer/quartz/blob/11f53ccfd9bcab5d823a8f4f51bd3b183481b999/hdl/projects/cosmo_seq/sequencer/sims/sp5_seq_sim_pkg.vhd#L176 for FPGA sims confirming this functionality at the FPGA.
- NIC domain MAPO prevents us from talking to the T6 temp sensor. This is interlocked in hardware so that we can't i2c to an un-powered device. The result here is that any attempted read will be a NACK.
- We're in this state long enough to have 56 logs, but since hubris thinks we're still in A0HP we continue polling the T6 temp sensor for temp. @sdonnan points out that this code artificially increases the temperature a little bit each missing read.
- Finally, we sit here long enough that the fake T6 temperature exceeds the limit and we move to A2 due to an "Overheat" although in this case, it's totally fake.
Metadata
Metadata
Assignees
Labels
No labels