Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

On Cosmo, hubris misses NIC_MAPO leading to a fake "Overheat" #2384

Assignees
Milestone
@nathanaelhuffman

Description

Here's a snip from the cosmo_seq that shows the issue, I've included from the inital power on just to set the baseline.

 TOTAL VARIANT
 13942 RegStateValues
 534 ContinueBitstreamLoad
 112 CPUPresent
 3 SetState(InitialPowerOn)
 1 SetState(Overheat)
 1 FpgaInit
 1 WaitForDone
 1 Programmed
 1 Startup
 1 SequencerInterrupt
 1 Coretype
 1 PmbusAlert
 NDX LINE GEN COUNT PAYLOAD
...
 62 468 6 1 SetState { prev: Some(A0), next: A0PlusHP, why: InitialPowerOn, now: 0x231fa }
 63 442 6 97 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(Done) }, nic_raw_status: NicRawStatusView { hw_sm: 0x6 } }
 64 442 6 1 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(NicReset) }, nic_raw_status: NicRawStatusView { hw_sm: 0x6 } }
 65 442 6 171 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(Done) }, nic_raw_status: NicRawStatusView { hw_sm: 0x6 } }
 66 741 6 1 SequencerInterrupt { our_state: A0PlusHP, seq_state: Ok(Done), ifr: IfrView { fanfault: false, thermtrip: false, smerr_assert: false, a0mapo: false, nicmapo: false, amd_pwrok_fedge: false, amd_rstn_fedge: false, fan_central_hsc_alert: false, fan_east_hsc_alert: false, fan_west_hsc_alert: false, ibc_alert: false, m2_hsc_alert: false, nic_hsc_alert: false, v12_ddr5_abcdef_hsc_alert: false, v12_ddr5_ghijkl_hsc_alert: false, v12_mcio_a0hp_hsc_alert: false, main_hsc_alert: false, vr_v1p8_sys_to_fpga1_alert: false, vr_v3p3_sys_to_fpga1_alert: false, vr_v5p0_sys_to_fpga1_alert: false, v0p96_nic_to_fpga1_alert: false, pwr_cont1_to_fpga1_alert: true, pwr_cont2_to_fpga1_alert: false, pwr_cont3_to_fpga1_alert: false } }
 67 781 6 1 PmbusAlert { now: 0x64b4d }
 68 442 6 56 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xe }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: 0x0 } }
 69 468 6 1 SetState { prev: Some(A0PlusHP), next: A2, why: Overheat, now: 0x72435 }
 70 442 6 1 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Disabling) }, seq_raw_status: SeqRawStatusView { hw_sm: 0xf }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: 0x0 } }
 71 442 6 2 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Idle) }, seq_raw_status: SeqRawStatusView { hw_sm: 0x0 }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: 0x0 } }

The sequence goes like this:

  1. At NDX 64, we're up and happy, in A0HP, the nic domain is up.
  2. The 12V rail droops due to a BMR491 dropout (mitigated in newer code than was running on this target).
  3. The 12V rail droop causes some pmbus alerts, and we expect the T6's MAX5970 to UVLO, the pmbus alert is seen in NDX 66/67.
  4. At NDX 68, it is clear that the FPGA has MAPO'd the nic domain as we're back in IDLE. I expect this should have set the nic mapo IFR, but we don't see indication of that in this trace. See https://github.com/oxidecomputer/quartz/blob/11f53ccfd9bcab5d823a8f4f51bd3b183481b999/hdl/projects/cosmo_seq/sequencer/sims/sp5_seq_sim_tb.vhd#L77 and https://github.com/oxidecomputer/quartz/blob/11f53ccfd9bcab5d823a8f4f51bd3b183481b999/hdl/projects/cosmo_seq/sequencer/sims/sp5_seq_sim_pkg.vhd#L176 for FPGA sims confirming this functionality at the FPGA.
  5. NIC domain MAPO prevents us from talking to the T6 temp sensor. This is interlocked in hardware so that we can't i2c to an un-powered device. The result here is that any attempted read will be a NACK.
  6. We're in this state long enough to have 56 logs, but since hubris thinks we're still in A0HP we continue polling the T6 temp sensor for temp. @sdonnan points out that this code artificially increases the temperature a little bit each missing read.
  7. Finally, we sit here long enough that the fake T6 temperature exceeds the limit and we move to A2 due to an "Overheat" although in this case, it's totally fake.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    AltStyle によって変換されたページ (->オリジナル) /