Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Simplify PCI code #5426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bchalios wants to merge 2 commits into firecracker-microvm:main
base: main
Choose a base branch
Loading
from bchalios:pci_refactoring

Conversation

Copy link
Contributor

@bchalios bchalios commented Sep 2, 2025
edited
Loading

This is the first in a series of PRs that are meant to simplify our PCI codebase

Changes

Mainly two changes:

  • Move (de)allocation of the BAR outside the PciDevice trait
  • Encode in the PCI handling logic the fact that we only ever create 64-bit BAR regions with fixed size (for VirtIO) devices

Reason

A lot of the PCI code we pulled from Cloud Hypervisor is built in the form of a library trying to be generic and support various functionalities of the PCI specification.a

In our case, we only support certain functionality. For example, we only have two types of PCI devices, VirtIO devices and the Root Port. Only VirtIO devices use BAR regions (64 bits). We don't use 32-bit, I/O or ROM BARs.

We'd like to move towards a PCI implementation which is tailored to our use-case instead of a generic library implementation.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Sep 2, 2025
edited
Loading

Codecov Report

❌ Patch coverage is 94.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.85%. Comparing base (5dde18c) to head (f34198a).

Files with missing lines Patch % Lines
src/pci/src/configuration.rs 96.87% 2 Missing ⚠️
src/vmm/src/device_manager/pci_mngr.rs 77.77% 2 Missing ⚠️
src/vmm/src/devices/virtio/transport/pci/device.rs 92.59% 2 Missing ⚠️
Additional details and impacted files
@@ Coverage Diff @@
## main #5426 +/- ##
==========================================
+ Coverage 82.73% 82.85% +0.12% 
==========================================
 Files 263 263 
 Lines 27455 27344 -111 
==========================================
- Hits 22715 22656 -59 
+ Misses 4740 4688 -52 
Flag Coverage Δ
5.10-m5n.metal 82.97% <94.00%> (+0.14%) ⬆️
5.10-m6a.metal 82.22% <94.00%> (+0.13%) ⬆️
5.10-m6g.metal 79.65% <94.00%> (+0.12%) ⬆️
5.10-m6i.metal 82.97% <94.00%> (+0.14%) ⬆️
5.10-m7a.metal-48xl 82.21% <94.00%> (+0.13%) ⬆️
5.10-m7g.metal 79.65% <94.00%> (+0.12%) ⬆️
5.10-m7i.metal-24xl 82.94% <94.00%> (+0.13%) ⬆️
5.10-m7i.metal-48xl 82.93% <94.00%> (+0.12%) ⬆️
5.10-m8g.metal-24xl 79.64% <94.00%> (+0.12%) ⬆️
5.10-m8g.metal-48xl 79.64% <94.00%> (+0.11%) ⬆️
6.1-m5n.metal 83.01% <94.00%> (+0.13%) ⬆️
6.1-m6a.metal 82.26% <94.00%> (?)
6.1-m6g.metal 79.65% <94.00%> (+0.12%) ⬆️
6.1-m6i.metal 83.01% <94.00%> (+0.14%) ⬆️
6.1-m7a.metal-48xl 82.26% <94.00%> (+0.13%) ⬆️
6.1-m7g.metal 79.65% <94.00%> (+0.12%) ⬆️
6.1-m7i.metal-24xl 83.01% <94.00%> (+0.13%) ⬆️
6.1-m7i.metal-48xl 83.02% <94.00%> (+0.14%) ⬆️
6.1-m8g.metal-24xl 79.64% <94.00%> (+0.12%) ⬆️
6.1-m8g.metal-48xl 79.64% <94.00%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PciDevice trait was defining (optional) methods to allocate and
deallocate BAR regions for devices. This logic is quite device specific
and when invoked we typically know what device it refers to exactly.
Currently, we only implement the allocate API for VirtIO devices (no
deallocation whatsoever).
Simplify things by dropping these APIs from the trait definition. Also,
remove the 32bit allocator argument from allocate_bars(); we always only
allocate a single 64-bit BAR for VirtIO devices.
Signed-off-by: Babis Chalios <bchalios@amazon.es>
In our case, only VirtIO PCI devices use a BAR region. Moreover, this
region has a fixed size and it's always a 64-bit region. This means we
don't handle 32-bit or IO BARs and we don't support a ROM bar. Encode
these assumptions to the logic that adds BAR regions and detects BAR
reprogramming attempts from the guest side so that we simplify the code.
Signed-off-by: Babis Chalios <bchalios@amazon.es>
@bchalios bchalios marked this pull request as ready for review September 2, 2025 12:19
@bchalios bchalios enabled auto-merge (rebase) September 2, 2025 12:19
@bchalios bchalios self-assigned this Sep 2, 2025
@bchalios bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Sep 2, 2025
let reg_idx = BAR0_REG + bar_idx;

if bar_idx >= NUM_BAR_REGS {
return Err(Error::BarInvalid(bar_idx));
error!("pci: invalid BAR index: {bar_idx}");
Copy link
Contributor

@ShadowCurse ShadowCurse Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error logs are quite useless because this error will propagate all the way to the vm builder code and then stop the firecracker. Actually maybe we can use asserts here? This is an internal configuration code, so we should not add 'invalid' bars anyway.

Copy link
Contributor Author

@bchalios bchalios Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are in the path of /actions InstanceStart, so I think we shouldn't assert. We should return an Error to the user, right? In that case, the error message in the logs is meant to help us debug more easily what went wrong.

Copy link
Contributor

@ShadowCurse ShadowCurse Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but user does not control the outcome (except I guess by the usage of pci flag), so in this case I think it is better to panic as this is our and not user responsibility to make sure our internal logic does not break.
Also regarding the error!: I don't mind it, but we don't do it in other places + the concatenated error printed to the use will contain same or more info than this. So I would removed mainly for consistency with rest of the code base.

Copy link
Contributor Author

@bchalios bchalios Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's bad user experience to crash the process even when we are crashing it because of an internal bug. I think the correct thing to do is let the user know that something went wrong, by returning an error code. On that I feel quite strongly about.

Regarding the error message, I don't have strong feelings, you are right that the error message will contain similar info. I just think it's easier for debugging, but I'm fine dropping it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, I thought the discussions we had internally were that for internal logic errors, we do want to panic. That's definitely how I've been writing my code for the last few months 🤔

Copy link
Contributor Author

@bchalios bchalios Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that any code along the code path of an API request should not crash. :/ Maybe I got it wrong?

Copy link
Contributor

@ShadowCurse ShadowCurse Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with returning errors along whole path which start with API request is that all interactions with FC fall under this category in some way or another. The discussions we had basically narrowed down to "if it is internal only" we should assert. Also, we already have asserts in PCI code like: https://github.com/firecracker-microvm/firecracker/blob/main/src/pci/src/msix.rs#L210

Copy link
Contributor Author

@bchalios bchalios Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but these asserts are triggered by the guest MMIO exits not along an API path. In any case, if indeed my understanding was wrong, I'm happy to turn these in assertions/unwraps

Copy link
Contributor

@ShadowCurse ShadowCurse Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. you are right, we have asserts in "guest mmio" path, but here we have error in "internal config" path, so we do error handling totally upside down so to speak... Here, since it is only internal configuration we should assert, while in "guest mmio" path we should have error propagation or better a validity checks up front and then assert inside PCI code.

let total_len = cap_data.bytes().len() + 2;
let (cap_offset, tail_offset) = match self.last_capability {
Some((offset, len)) => (Self::next_dword(offset, len), offset + 1),
None => (FIRST_CAPABILITY_OFFSET, CAPABILITY_LIST_HEAD_OFFSET),
};
let end_offset = cap_offset
.checked_add(total_len)
.ok_or(Error::CapabilitySpaceFull(total_len))?;
.ok_or(PciError::CapabilitySpaceOverflow(total_len, cap_offset))?;
Copy link
Contributor

@ShadowCurse ShadowCurse Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about asserts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Reviewers

@ShadowCurse ShadowCurse ShadowCurse left review comments

@roypat roypat roypat left review comments

At least 2 approving reviews are required to merge this pull request.

Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /