Conversation
📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces a new command-line flag, In the The Finally, the Possibly related PRs
Suggested reviewers
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (8)
cmd/connect-all.go (2)
58-59: Consider adding a shorthand flag option for consistency.Most other flags in this command have shorthand options (e.g.,
-a,-s,-q). Let's consider adding a shorthand forctrl-loss-tmoto maintain consistency with the existing pattern.Here's a suggested change:
-cmd.Flags().IntP("ctrl-loss-tmo", "", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)") +cmd.Flags().IntP("ctrl-loss-tmo", "l", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)")
Line range hint
75-75: Fix undefined variable usage.There seems to be an undefined
katovariable being used in theDiscoverRequest. We should useviper.GetInt("connect-all.kato")for consistency.Here's the suggested fix:
- Kato: kato, + Kato: viper.GetInt("connect-all.kato"),pkg/nvmeclient/nvme_client.go (2)
505-507: Add parameter documentation for better maintainability.Let's add documentation for the new
ctrlLossTMOparameter to clarify its purpose and expected values.func ConnectAll(discoveryRequest *hostapi.DiscoverRequest, maxIOQueues int, kato int, - ctrlLossTMO int) ([]*CtrlIdentifier, error) { + ctrlLossTMO int) ([]*CtrlIdentifier, error) { + // ctrlLossTMO: Controller loss timeout in seconds. -1 disables the timeout.
535-539: Consider adding validation for extreme timeout values.While the implementation is correct, let's consider adding validation for the
ctrlLossTMOparameter to prevent potential issues with extreme values. TheToOptions()method already checks for>= -1, but we might want to add an upper bound check.func ConnectAllNVMEDevices(logPageEntries []*hostapi.NvmeDiscPageEntry, hostnqn string, transport string, maxIOQueues int, kato int, ctrlLossTMO int, ) []*CtrlIdentifier { + // Validate ctrlLossTMO bounds + if ctrlLossTMO > 3600 { // Example: limit to 1 hour + logrus.Warnf("ctrlLossTMO value %d exceeds maximum allowed (3600), capping to 3600", ctrlLossTMO) + ctrlLossTMO = 3600 + }Also applies to: 552-552
pkg/clientconfig/cache.go (4)
51-53: Let's maintain consistent field naming conventions.The field naming is inconsistent between structs:
TKey.ctrlLossTMOuses camelCaseConnection.CtrlLossTMOuses PascalCaseAdditionally, let's enhance the comment for better clarity.
type TKey struct { transport string Ip string port int - ctrlLossTMO int + CtrlLossTMO int // Controller loss timeout in seconds (-1 to disable) } type Connection struct { // ... - CtrlLossTMO int // seconds + CtrlLossTMO int // Controller loss timeout in seconds (-1 to disable) }Also applies to: 65-65
491-493: Let's improve code readability with a more concise initialization.The TKey initialization can be more readable on a single line.
- key := TKey{transport: newEntry.Transport, Ip: newEntry.Traddr, - port: newEntry.Trsvcid, Nqn: newEntry.Subsysnqn, - hostnqn: newEntry.Hostnqn, ctrlLossTMO: newEntry.CtrlLossTMO} + key := TKey{ + transport: newEntry.Transport, + Ip: newEntry.Traddr, + port: newEntry.Trsvcid, + Nqn: newEntry.Subsysnqn, + hostnqn: newEntry.Hostnqn, + ctrlLossTMO: newEntry.CtrlLossTMO, + }
507-507: Let's enhance the error message for better debugging.The error message could be more descriptive about the mismatch.
- err := fmt.Errorf("Entry %+v not cached, though '%s' is in cache", newEntry, conn) + err := fmt.Errorf("Entry %+v not found in cache entries, but connection '%s' exists in cache connections", newEntry, conn)
We need to update the
getEntryFromReferralfunction to include the CtrlLossTMO fieldThe
Entrystruct includes aCtrlLossTMOfield, but when creating new entries from referrals ingetEntryFromReferral, we're not setting this field. This could lead to inconsistent timeout behavior.
- In
pkg/clientconfig/cache.go, we need to updategetEntryFromReferralto include theCtrlLossTMOfield in the returnedEntry.- The timeout value should be propagated from the referral context or use a default value consistent with the rest of the codebase.
🔗 Analysis chain
Line range hint
513-590: Let's verify referral handling with the new timeout field.The referral handling code in
HandleReferralsand related functions should be reviewed to ensure proper handling of theCtrlLossTMOfield when processing referrals.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if CtrlLossTMO is properly handled in referral-related code echo "Checking referral handling code for CtrlLossTMO usage..." # Search for referral-to-entry conversion points rg -A 5 "getEntryFromReferral|addConnectionsFromReferrals" # Look for potential timeout-related patterns in referral handling ast-grep --pattern 'func getEntryFromReferral($_, $_) { $$$ return &Entry{ $$$ } }'Length of output: 2343
Script:
#!/bin/bash # Let's check the Entry struct definition and its usage echo "Checking Entry struct definition..." rg -A 10 "type Entry struct" pkg/clientconfig/ # Let's also check if CtrlLossTMO is used in the Entry struct echo -e "\nChecking CtrlLossTMO usage in Entry..." rg "CtrlLossTMO" pkg/clientconfig/ # Let's examine the Referral struct and NvmeDiscPageEntry to understand the timeout field echo -e "\nChecking Referral and NvmeDiscPageEntry definitions..." rg -A 5 "type.*NvmeDiscPageEntry.*struct"Length of output: 2095
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
cmd/connect-all.go(3 hunks)pkg/clientconfig/cache.go(4 hunks)pkg/clientconfig/conf_parser.go(3 hunks)pkg/nvmeclient/nvme_client.go(4 hunks)service/service.go(2 hunks)
🔇 Additional comments (9)
cmd/connect-all.go (2)
22-24: LGTM! Clean import organization.
The imports are well-organized and necessary for the new functionality.
79-83: LGTM! Let's verify the ConnectAll interface.
The implementation looks good and properly integrates the new timeout parameter. Let's verify the corresponding changes in the nvmeclient package.
✅ Verification successful
The ConnectAll interface matches the usage perfectly
The verification shows that the ConnectAll function in pkg/nvmeclient/nvme_client.go correctly accepts the ctrlLossTMO parameter, and there are no other usages of ConnectAll that need updating. The implementation is consistent across the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the ConnectAll function signature in nvmeclient package
# Check the ConnectAll function definition
ast-grep --pattern 'func ConnectAll($_, $_, $_, $_) $_ {
$$$
}'
# Look for any other usages of ConnectAll that might need updating
rg -A 2 'ConnectAll\('
Length of output: 1252
pkg/clientconfig/conf_parser.go (1)
165-176: Let's improve the ctrl-loss-tmo flag handling
While the basic parsing is implemented correctly, we should consider a few enhancements:
- Let's add validation for the maximum allowed value
- We should set a default value when none is provided
- It would be helpful to document the flag's meaning and valid range in a comment
Here's a suggested enhancement:
+const (
+ // Maximum controller loss timeout in seconds
+ // -1: disabled, 0: no timeout, >0: timeout in seconds
+ MaxCtrlLossTMO = 3600 // 1 hour
+)
case "-l", "--ctrl-loss-tmo":
i++
value := strings.TrimSpace(s[i])
ctrlLossTMO, err := strconv.ParseInt(value, 10, 32)
if err != nil {
return nil, &ParserError{
Msg: fmt.Sprintf("bad controller loss timeout value"),
Details: fmt.Sprintf("%s is not a valid int", s[i]),
Err: err,
}
}
+ if ctrlLossTMO > MaxCtrlLossTMO {
+ return nil, &ParserError{
+ Msg: fmt.Sprintf("controller loss timeout value too large"),
+ Details: fmt.Sprintf("maximum allowed value is %d", MaxCtrlLossTMO),
+ Err: nil,
+ }
+ }
e.CtrlLossTMO = int(ctrlLossTMO)
+} else {
+ e.CtrlLossTMO = -1 // Set default to disabledLet's verify the usage of CtrlLossTMO across the codebase:
service/service.go (2)
59-59: Formatting change noted
This is a minor formatting adjustment that doesn't affect functionality.
258-260: Implementation aligns with PR objectives
The addition of conn.CtrlLossTMO parameter to the ConnectAllNVMEDevices call successfully implements the ctrl-loss-tmo flag support. Let's verify the parameter usage across the codebase to ensure consistent implementation.
Let's run this script to verify the parameter usage:
✅ Verification successful
CtrlLossTMO parameter is consistently implemented across the codebase
The verification shows that the CtrlLossTMO parameter is properly integrated:
- Defined in the
Connectionstruct with clear documentation (seconds) - Correctly initialized through the configuration chain:
- Command line via
connect.ctrl-loss-tmoflag - Properly passed through connection initialization
- Consistently used in
ConnectAllNVMEDevicescalls
- Command line via
The implementation is complete and consistent with no issues found.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the consistent usage of CtrlLossTMO across the codebase
# Check for CtrlLossTMO field definition in Connection struct
echo "Checking Connection struct definition..."
rg -p "type\s+Connection\s+struct" -A 20
# Check for CtrlLossTMO initialization
echo "Checking CtrlLossTMO initialization..."
rg "CtrlLossTMO\s*[:=]"
# Check for ConnectAllNVMEDevices usage to ensure all calls include the new parameter
echo "Checking ConnectAllNVMEDevices usage..."
rg -p "ConnectAllNVMEDevices\s*\(" -A 3
Length of output: 1958
pkg/nvmeclient/nvme_client.go (2)
31-32: LGTM!
The imports are properly organized.
512-513: LGTM!
The parameter is correctly passed through to the underlying function.
pkg/clientconfig/cache.go (2)
68-68: LGTM: Function signature update is clean and well-integrated.
The addition of the ctrlLossTMO parameter to newConnection aligns well with the new functionality.
70-73: LGTM: Connection initialization is properly handled.
The CtrlLossTMO field is correctly initialized along with other connection properties.
06493fa to
188de4b
Compare
muliby-lb
left a comment
There was a problem hiding this comment.
@yogev-lb seems legit, but:
- does this change the default behavior or do you only get a different behavior if you set
ctrl-loss-tmoto something other than -1? - how did you test this? here are the
discovery-clienttests we have in the CI, at least 279 is a must:
-> /home/muli/lightbits/src/systests/racktests/1923_etcd_mTLS_cross_server_connectivity_new_installation.py [CI: sanity]
-> /home/muli/lightbits/src/systests/racktests/2301_physical_alma_8_installation_sanity.py [CI: weekly (unstable)]
-> /home/muli/lightbits/src/systests/racktests/239_alma_8_vm_installation_test.py [CI: sanity minisanity]
-> /home/muli/lightbits/src/systests/racktests/279_duros_discovery_client.py [CI: weekly (unstable)]
-> /home/muli/lightbits/src/systests/racktests/280_monitoring_installation_test.py [CI: weekly monitoring-stack]
-> /home/muli/lightbits/src/systests/racktests/529_discovery_vm_installation_add_node.py [CI: weekly]
-> /home/muli/lightbits/src/systests/racktests/609_vm_installation_pull.py [CI: nightly (unstable)]
-> /home/muli/lightbits/src/systests/racktests/625_vm_installation_add_new_server_to_cluster.py [CI: weekly]
- @elada-lb does the PR check work already? can we try it on this PR?
|
btw @yogev-lb gentle reminder, this is a public repo |
|
@muliby-lb to answer your Q:
|
Yea we should be good to go with running it |
@elada-lb thanks can you point me at the documentation on how to run it? |
|
PR check running here: https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311 (thanks @elada-lb) |
@elada-lb doesn't look like it ran much... Can you take a look? |
|
Thanks @elada-lb, I see it now. Looking at the workflow file, I see which unless I am mistaken means it ran on the master branch and not on this PR branch, |
|
@muliby-lb @ronen-lb The CI passed for the DC acroding to this run: https://github.com/lightbitslabs/lbcitests/actions/runs/12006959193 waiting for approval... |
|
@yogev-lb I ran 239 a couple of times manually and it fails both times because the @elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check. |
I'm pretty sure that's the case, we're currently looking into this. |
After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues. |
278ca61 to
364fe8d
Compare
@elada-lb failed in https://github.com/LightBitsLabs/lbcitests/actions/runs/12044433121/job/33581436912, can you please take a look? |
@muliby-lb Why'd you pass |
@elada-lb quite possibly I misunderstood how it's meant to work. Where do I specify which branch/PR to run the PR check on? |
55aa786 to
8011d33
Compare
8011d33 to
c5a449a
Compare
c5a449a to
8011d33
Compare
8011d33 to
f45b96f
Compare
till now we didn't support this flag in the DC configuraion. it was added only on the connect command line but not as a service. needed to update the parser of the file, update the cache and pass this information to the ConnectAll command from the service. we differneciate between entries to the cache coming from user/referrals the reason we do that is that a user may provide ctrl-loss-tmo on a server that is not the default, and in case a referral will notify on a new server added we will not have this value. when we add the referral entry we will go over all the user entries and if one of them is not nil we will use this one and apply if to the referral one. sadly the DS will not be able to provide us this info about the new server we connected. issue: LBM1-35562
there are still OS out that does not load nvme-tcp by default. this will cause the DC to not function properly when it tries to connect to the io-controller. today we will see it only in the logs but the server will run successfully and it is tricky to locate. this change will fail-fast, on service load and will report what in high probablility needed to be done: nvme-tcp.
today when we fail on application error we will see the help msg which is not relevant for these errors and is confusing.
if we want to run liveness probes on discovery-client we need to run it from the container. when a service exposes an endpoint - like this one exposing http it is good practice to expose an healthz endpoint that can be invoked by k8s or docker-compose and indicate about container status.
If the interval will be 0 we will panic on: `panic: non-positive interval for NewTicker` set the deafult to 5s to prevent such issues Signed-off-by: alon <alon@lightbitslabs.com>
We notice that there may be a race between writing and closing the file and shooting down the host abruptly. Signed-off-by: alon <alon@lightbitslabs.com>
f45b96f to
8d61b4e
Compare
We do not need to be depends on any external repo's version. this is an independed repo that need to controll it's own version Signed-off-by: alon <alon@lightbitslabs.com>
8d61b4e to
2f09aae
Compare
|
|
@yogev-lb I don't see a +1 on this from anyone, myself included... I'm happy to re-review and +1 if warranted, but it's better if it will come from someone else who is familiar with the code you're changing. |
| Short: "Start NVMeOF Discovery Client", | ||
| Long: ``, | ||
| DisableAutoGenTag: true, | ||
| SilenceUsage: true, |
|
for some reason the discovery-client tests dont run as part of this PR or im missing something? At some point 2 months ago i think you said something about a new PR check suite for this applications |
|
@yogev-lb see the README.md for how to run PR checks on this (public) repo. |
|
@yogev-lb ran PR check here: https://github.com/lightbitslabs/lbcitests/actions/runs/13175771922 |



till now we didn't support this flag in the DC configuration. it was added only on the connect command line but not as a service. needed to update the parser of the file, update the cache and pass this information to the ConnectAll command from the service.
This change was detected when working on DMS.
We noticed in negative tests that when we kill the node of the volume we are using (vol become unavailable) we have
for ever (until volume becomes available again)
This is not a good behavior for us cause we want to fail fast (
1m) and decide if we want to proceed or not. (we as opposed to other customers don't really care about the data (we can always retry the clone) and we don't support data resume yet so we can't retry when available.It supersized me that this feature is not working - since i remember that we told many customers (including @yant-lb in the LAB) that it does.
see:
Issue: LBM1-36497