-
Notifications
You must be signed in to change notification settings - Fork 9
Directnic nvidia stable 10.1 #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nvidia_stable-10.1
Are you sure you want to change the base?
Changes from all commits
60e3535
2946d5f
572888e
ec9eaad
1ecaf27
aa06021
bbbdeb8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1178,10 +1178,11 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset, bool aligned) | |
| } | ||
|
|
||
| /* ACS (Access Control Services) */ | ||
| void pcie_acs_init(PCIDevice *dev, uint16_t offset) | ||
| int pcie_acs_init(PCIDevice *dev, uint16_t offset, uint16_t ctrl_bits, Error **errp) | ||
| { | ||
| bool is_downstream = pci_is_express_downstream_port(dev); | ||
| uint16_t cap_bits = 0; | ||
| PCIEPort *p = PCIE_PORT(dev); | ||
|
|
||
| /* For endpoints, only multifunction devs may have an ACS capability: */ | ||
| assert(is_downstream || | ||
|
|
@@ -1202,16 +1203,33 @@ void pcie_acs_init(PCIDevice *dev, uint16_t offset) | |
| */ | ||
| cap_bits = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | | ||
| PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT; | ||
|
|
||
| if (ctrl_bits & ~cap_bits) { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. capability check LGTM
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for your thorough review :) |
||
| error_setg(errp, "Unsupported ACS capabilities 0x%hx were supplied. " | ||
| "Supported capabilities are 0x%hx", ctrl_bits & ~cap_bits, | ||
| cap_bits); | ||
| return -EINVAL; | ||
| } | ||
| } | ||
|
|
||
| pci_set_word(dev->config + offset + PCI_ACS_CAP, cap_bits); | ||
| pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits); | ||
|
|
||
| if (is_downstream && p->acs_caps) { | ||
| /* Block guest writes to ACS Control entirely to preserve QEMU ACS settings */ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mainly for my understanding: If this is only needed for GPUDirect RDMA, is it necessary to make this change for all PCI devices (as opposed to just NICs and GPUs?) I'd also like to have a better understanding of whether you think this kind of change might cause any unexpected end-user-visible behavior when doing passthrough operations, which might manifest differently than if they were doing the same operation on a different platform with mainline qemu. (If so, can you link to where this divergent behavior is documented somewhere in your end-user instructions?) Also, making the ACS read-only to the guest seems like something that might be useful to other hardware, and also might be nice to have as an explicit qemu configuration option rather than being implemented in this function - are there plans to make this a more general configurable in that future upstream PR?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We are adding 'acs_caps' as a generic option, right? ACS is per device primary applicable RP and Downstream Ports , and so 'acs_caps'.
Admin who is launching QEMU and using ACS should know what they are doing. Like kernel which allows to use 'config_acs' kernel parameter , acs_caps serves the same purpose, for example in this case to allow p2p between pcie devices e.g. GPU and NIC.
The acs_caps is generic option for all PCIe. FWIW, here is the Testing command for GB200/GB300 GPUDirect RDMA using Nvidia GPU, CX8 and Data Direct Interface: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the example command @tdavenvidia .
Yeah, that much makes sense to me (that if you set acs-caps=0 for a port in qemu options, it will be read-only), but what I'm really asking is more like: Is the new behavior (of forcing the ACS to be read-only for all downstream ports, even if the user set acs-caps to something
That's reasonable. My only concern was that it seemed to me that this could be viewed as unexpected behavior, if the user sets acs-caps=nonzero on a downstream port in their qemu opts, and then it is silently overridden to be 0 here - but if you think that users would only ever really be touching that option if they already have themselves reviewed how it works in their qemu source, I think your code comment on line 1218 is probably sufficient. (Let me know if you think I am misinterpreting anything here - the documentation on these options is a bit sparse, and I have not experimented with them much as a qemu user before.)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The goal here is to allow user to configure ACS settings for the PCIe device before VM kernel loads and keep preserve those ACS bits by making them read-only , that way the intended p2p works out of the box. Otherwise there is no point of having 'acs-caps' qemu parameter because linux kernel changes the ACS for PCI device anyways for IOMMU isolation. Hint: check linux kernel pci_std_enable_acs() There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense, thank you for the clarification. |
||
| pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, 0); | ||
| } else { | ||
| pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits); | ||
| } | ||
|
|
||
| pci_set_word(dev->config + offset + PCI_ACS_CTRL, ctrl_bits); | ||
|
|
||
| return 0; | ||
| } | ||
|
|
||
| void pcie_acs_reset(PCIDevice *dev) | ||
| void pcie_acs_reset(PCIDevice *dev, uint16_t val) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: can we name this something more descriptive than
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "val" refers to as a PCIe ACS Register value, so IMO it is fine. |
||
| { | ||
| if (dev->exp.acs_cap) { | ||
| pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0); | ||
| pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, val); | ||
| } | ||
| } | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.