Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Disabled the setting reboot.host.and.alert.management.on.heartbeat.timeout by default#10111

Draft
slavkap wants to merge 1 commit intoapache:main from
storpool:do-not-reboot-host-on-heartbeat-timeout
Draft

Disabled the setting reboot.host.and.alert.management.on.heartbeat.timeout by default #10111
slavkap wants to merge 1 commit intoapache:main from
storpool:do-not-reboot-host-on-heartbeat-timeout

Conversation

@slavkap
Copy link
Contributor

@slavkap slavkap commented Dec 16, 2024

Description

This PR disables the setting reboot.host.and.alert.management.on.heartbeat.timeout. When there is a storage issue, even if the high availability isn't enabled, CloudStack will reboot the host.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

GutoVeronezi reacted with heart emoji
Copy link

codecov bot commented Dec 16, 2024
edited
Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 15.12%. Comparing base (a2f2e87) to head (79a5f78).

Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #10111 +/- ##
============================================
- Coverage 15.13% 15.12% -0.01% 
+ Complexity 11268 11262 -6 
============================================
 Files 5408 5408 
 Lines 473867 473867 
 Branches 57778 57778 
============================================
- Hits 71700 71684 -16 
- Misses 394165 394185 +20 
+ Partials 8002 7998 -4 
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests 15.84% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@slavkap , have you tested this with HA enabled?

Copy link
Contributor

@slavkap
can you start a discussion on dev/user mailing list ?

this changes the current behaviour.
IMHO, if no objections, we could merge it in 4.21(next major release), but not 4.20/4.19

`reboot.host.and.alert.management.on.heartbeat.timeout` has to be
disabled. Even the high availability isn't enabled when there is an
issue with a storage CloudStack will reboot the host
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@slavkap slavkap changed the base branch from 4.19 to main December 17, 2024 09:52
Copy link
Contributor Author

slavkap commented Dec 17, 2024

@DaanHoogland, I've tested this with and without HA
@weizhouapache, sure, I'll start a discussion for this

@slavkap slavkap marked this pull request as draft December 17, 2024 09:56
@DaanHoogland DaanHoogland changed the title (削除) Disabled the setting do-not-reboot-host-on-heartbeat-timeout to not reboot a host on heartbeat timeout (削除ここまで) (追記) Disabled the setting reboot.host.and.alert.management.on.heartbeat.timeout by default (追記ここまで) Jan 8, 2025
Copy link
Contributor

@slavkap , I changed the title . Hope you don't mind. It was a bit confusing to me.
Are you still looking into this?

Copy link
Contributor Author

slavkap commented Jan 10, 2025

@DaanHoogland, I don't mind the change, thanks!
Yes, I opened a discussion in the mailing list for this

DaanHoogland reacted with thumbs up emoji

@DaanHoogland DaanHoogland modified the milestones: 4.19.2, 4.19.3 Feb 3, 2025
Copy link
Contributor

moved forward

@slavkap slavkap modified the milestones: 4.19.3, 4.21.0 Feb 3, 2025
Copy link
Contributor Author

slavkap commented Feb 3, 2025

@DaanHoogland, I rebased it on main as @weizhouapache suggested merging it possibly in a major release.

DaanHoogland reacted with thumbs up emoji

Copy link

boubouX commented Mar 28, 2025

We experienced the unfortunate event of this issue, causing cascading reboots of all our hosts while the NFS server had no running VM. It was an operational nightmare that resulted in approximately 45 minutes of downtime. Changing its default value to false offers us more gain than loss. We adjusted it to our settings; thank you, Wei. This was simply catastrophic!

hanisirfan reacted with thumbs up emoji

Copy link

As someone who works with VMware products, I never had an experience where a host reboots when datastore are inaccessible. I believe changing the default for CloudStack to "false" is a great move.

Copy link
Contributor

@blueorangutan package

Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13621

Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 13671

Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13677

Copy link
Contributor

Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Copy link
Contributor

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

Copy link
Contributor

@sureshanaparti , I think we can merge this one, pending smoke tests. But it merits a note in the release notes page for the next version.

weizhouapache reacted with thumbs up emoji

Copy link

[SF] Trillian test result (tid-13502)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 55426 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10111-t13502-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link

@slavkap Since this is for the 4.22.1 release, could you retarget the PR to the 4.22 branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@weizhouapache weizhouapache weizhouapache approved these changes

Assignees

No one assigned

Projects

Status: In Progress

Milestone

4.22.1

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /