fix ctrl-loss-tmo flag support by yogev-lb · Pull Request #30 · LightBitsLabs/discovery-client

yogev-lb · 2024-11-21T12:37:12Z

till now we didn't support this flag in the DC configuration. it was added only on the connect command line but not as a service. needed to update the parser of the file, update the cache and pass this information to the ConnectAll command from the service.

This change was detected when working on DMS.
We noticed in negative tests that when we kill the node of the volume we are using (vol become unavailable) we have
for ever (until volume becomes available again)

This is not a good behavior for us cause we want to fail fast (1m) and decide if we want to proceed or not. (we as opposed to other customers don't really care about the data (we can always retry the clone) and we don't support data resume yet so we can't retry when available.

It supersized me that this feature is not working - since i remember that we told many customers (including @yant-lb in the LAB) that it does.

see:

Issue: LBM1-36497

coderabbitai · 2024-11-21T12:39:56Z

📝 Walkthrough

Walkthrough

The pull request introduces a new command-line flag, ctrl-loss-tmo, to the connect-all command in the cmd/connect-all.go file, allowing users to specify a controller loss timeout period. The default value is set to -1, indicating that the timeout is disabled. This flag is integrated into the command's functionality and is passed to the nvmeclient.ConnectAll function as an additional argument.

In the pkg/clientconfig/cache.go file, modifications include the addition of the ctrlLossTMO field to the TKey and Connection structs, along with updates to the newConnection function to accept this parameter. The addEntry method is also adjusted to propagate the CtrlLossTMO value when creating new connections.

The pkg/clientconfig/conf_parser.go file sees the addition of a CtrlLossTMO field in the Entry struct, with corresponding updates to the parsing logic to handle the new command-line argument.

Finally, the nvme_client.go file updates the ConnectAll and ConnectAllNVMEDevices functions to incorporate the new timeout parameter, while the service/service.go file adjusts the Start method to pass this parameter during device connections.

Possibly related PRs

discovery-client: allow to override default NVMe host Kato. #27: Modifications to the service struct and its methods to include a new kato parameter, which is directly related to the ctrl-loss-tmo flag introduced in the main PR, as both deal with timeout configurations in connection management.

Suggested reviewers

anton-lb: Familiar with the connection management logic and related changes.
arturm-lb: Experienced in the command-line interface and configuration aspects of the project.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (8)

cmd/connect-all.go (2)
58-59: Consider adding a shorthand flag option for consistency.

Most other flags in this command have shorthand options (e.g., -a, -s, -q). Let's consider adding a shorthand for ctrl-loss-tmo to maintain consistency with the existing pattern.

Here's a suggested change:
-cmd.Flags().IntP("ctrl-loss-tmo", "", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)")
+cmd.Flags().IntP("ctrl-loss-tmo", "l", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)")
Line range hint 75-75: Fix undefined variable usage.

There seems to be an undefined kato variable being used in the DiscoverRequest. We should use viper.GetInt("connect-all.kato") for consistency.

Here's the suggested fix:
-		Kato:      kato,
+		Kato:      viper.GetInt("connect-all.kato"),
pkg/nvmeclient/nvme_client.go (2)
505-507: Add parameter documentation for better maintainability.

Let's add documentation for the new ctrlLossTMO parameter to clarify its purpose and expected values.
 func ConnectAll(discoveryRequest *hostapi.DiscoverRequest,
 	maxIOQueues int, kato int,
-	ctrlLossTMO int) ([]*CtrlIdentifier, error) {
+	ctrlLossTMO int) ([]*CtrlIdentifier, error) {
+	// ctrlLossTMO: Controller loss timeout in seconds. -1 disables the timeout.
535-539: Consider adding validation for extreme timeout values.

While the implementation is correct, let's consider adding validation for the ctrlLossTMO parameter to prevent potential issues with extreme values. The ToOptions() method already checks for >= -1, but we might want to add an upper bound check.
 func ConnectAllNVMEDevices(logPageEntries []*hostapi.NvmeDiscPageEntry,
 	hostnqn string, transport string,
 	maxIOQueues int, kato int,
 	ctrlLossTMO int,
 ) []*CtrlIdentifier {
+	// Validate ctrlLossTMO bounds
+	if ctrlLossTMO > 3600 { // Example: limit to 1 hour
+		logrus.Warnf("ctrlLossTMO value %d exceeds maximum allowed (3600), capping to 3600", ctrlLossTMO)
+		ctrlLossTMO = 3600
+	}
Also applies to: 552-552
pkg/clientconfig/cache.go (4)
51-53: Let's maintain consistent field naming conventions.

The field naming is inconsistent between structs:

TKey.ctrlLossTMO uses camelCase

Connection.CtrlLossTMO uses PascalCase

Additionally, let's enhance the comment for better clarity.
 type TKey struct {
     transport string
     Ip        string
     port      int
-    ctrlLossTMO int
+    CtrlLossTMO int // Controller loss timeout in seconds (-1 to disable)
 }

 type Connection struct {
     // ...
-    CtrlLossTMO  int // seconds
+    CtrlLossTMO  int // Controller loss timeout in seconds (-1 to disable)
 }
Also applies to: 65-65

491-493: Let's improve code readability with a more concise initialization.

The TKey initialization can be more readable on a single line.
-    key := TKey{transport: newEntry.Transport, Ip: newEntry.Traddr,
-        port: newEntry.Trsvcid, Nqn: newEntry.Subsysnqn,
-        hostnqn: newEntry.Hostnqn, ctrlLossTMO: newEntry.CtrlLossTMO}
+    key := TKey{
+        transport:    newEntry.Transport,
+        Ip:          newEntry.Traddr,
+        port:        newEntry.Trsvcid,
+        Nqn:         newEntry.Subsysnqn,
+        hostnqn:     newEntry.Hostnqn,
+        ctrlLossTMO: newEntry.CtrlLossTMO,
+    }
507-507: Let's enhance the error message for better debugging.

The error message could be more descriptive about the mismatch.
-    err := fmt.Errorf("Entry %+v not cached, though '%s' is in cache", newEntry, conn)
+    err := fmt.Errorf("Entry %+v not found in cache entries, but connection '%s' exists in cache connections", newEntry, conn)
We need to update the getEntryFromReferral function to include the CtrlLossTMO field

The Entry struct includes a CtrlLossTMO field, but when creating new entries from referrals in getEntryFromReferral, we're not setting this field. This could lead to inconsistent timeout behavior.

In pkg/clientconfig/cache.go, we need to update getEntryFromReferral to include the CtrlLossTMO field in the returned Entry.

The timeout value should be propagated from the referral context or use a default value consistent with the rest of the codebase.

🔗 Analysis chain

Line range hint 513-590: Let's verify referral handling with the new timeout field.

The referral handling code in HandleReferrals and related functions should be reviewed to ensure proper handling of the CtrlLossTMO field when processing referrals.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check if CtrlLossTMO is properly handled in referral-related code
echo "Checking referral handling code for CtrlLossTMO usage..."

# Search for referral-to-entry conversion points
rg -A 5 "getEntryFromReferral|addConnectionsFromReferrals" 

# Look for potential timeout-related patterns in referral handling
ast-grep --pattern 'func getEntryFromReferral($_, $_) {
  $$$
  return &Entry{
    $$$
  }
}'
Length of output: 2343

Script:
#!/bin/bash
# Let's check the Entry struct definition and its usage
echo "Checking Entry struct definition..."
rg -A 10 "type Entry struct" pkg/clientconfig/

# Let's also check if CtrlLossTMO is used in the Entry struct
echo -e "\nChecking CtrlLossTMO usage in Entry..."
rg "CtrlLossTMO" pkg/clientconfig/

# Let's examine the Referral struct and NvmeDiscPageEntry to understand the timeout field
echo -e "\nChecking Referral and NvmeDiscPageEntry definitions..."
rg -A 5 "type.*NvmeDiscPageEntry.*struct" 
Length of output: 2095

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4c9f6bf and f34f4b6.

📒 Files selected for processing (5)

cmd/connect-all.go (3 hunks)
pkg/clientconfig/cache.go (4 hunks)
pkg/clientconfig/conf_parser.go (3 hunks)
pkg/nvmeclient/nvme_client.go (4 hunks)
service/service.go (2 hunks)

🔇 Additional comments (9)

cmd/connect-all.go (2)

22-24: LGTM! Clean import organization.

The imports are well-organized and necessary for the new functionality.

79-83: LGTM! Let's verify the ConnectAll interface.

The implementation looks good and properly integrates the new timeout parameter. Let's verify the corresponding changes in the nvmeclient package.

✅ Verification successful

The ConnectAll interface matches the usage perfectly

The verification shows that the ConnectAll function in pkg/nvmeclient/nvme_client.go correctly accepts the ctrlLossTMO parameter, and there are no other usages of ConnectAll that need updating. The implementation is consistent across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the ConnectAll function signature in nvmeclient package

# Check the ConnectAll function definition
ast-grep --pattern 'func ConnectAll($_, $_, $_, $_) $_ {
  $$$
}'

# Look for any other usages of ConnectAll that might need updating
rg -A 2 'ConnectAll\('

Length of output: 1252

pkg/clientconfig/conf_parser.go (1)

165-176: Let's improve the ctrl-loss-tmo flag handling

While the basic parsing is implemented correctly, we should consider a few enhancements:

Let's add validation for the maximum allowed value
We should set a default value when none is provided
It would be helpful to document the flag's meaning and valid range in a comment

Here's a suggested enhancement:

+const (
+    // Maximum controller loss timeout in seconds
+    // -1: disabled, 0: no timeout, >0: timeout in seconds
+    MaxCtrlLossTMO = 3600 // 1 hour
+)

 case "-l", "--ctrl-loss-tmo":
     i++
     value := strings.TrimSpace(s[i])
     ctrlLossTMO, err := strconv.ParseInt(value, 10, 32)
     if err != nil {
         return nil, &ParserError{
             Msg:     fmt.Sprintf("bad controller loss timeout value"),
             Details: fmt.Sprintf("%s is not a valid int", s[i]),
             Err:     err,
         }
     }
+    if ctrlLossTMO > MaxCtrlLossTMO {
+        return nil, &ParserError{
+            Msg:     fmt.Sprintf("controller loss timeout value too large"),
+            Details: fmt.Sprintf("maximum allowed value is %d", MaxCtrlLossTMO),
+            Err:     nil,
+        }
+    }
     e.CtrlLossTMO = int(ctrlLossTMO)
+} else {
+    e.CtrlLossTMO = -1 // Set default to disabled

Let's verify the usage of CtrlLossTMO across the codebase:

service/service.go (2)

59-59: Formatting change noted

This is a minor formatting adjustment that doesn't affect functionality.

258-260: Implementation aligns with PR objectives

The addition of conn.CtrlLossTMO parameter to the ConnectAllNVMEDevices call successfully implements the ctrl-loss-tmo flag support. Let's verify the parameter usage across the codebase to ensure consistent implementation.

Let's run this script to verify the parameter usage:

✅ Verification successful

CtrlLossTMO parameter is consistently implemented across the codebase

The verification shows that the CtrlLossTMO parameter is properly integrated:

Defined in the Connection struct with clear documentation (seconds)
Correctly initialized through the configuration chain:
- Command line via connect.ctrl-loss-tmo flag
- Properly passed through connection initialization
- Consistently used in ConnectAllNVMEDevices calls

The implementation is complete and consistent with no issues found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the consistent usage of CtrlLossTMO across the codebase

# Check for CtrlLossTMO field definition in Connection struct
echo "Checking Connection struct definition..."
rg -p "type\s+Connection\s+struct" -A 20

# Check for CtrlLossTMO initialization
echo "Checking CtrlLossTMO initialization..."
rg "CtrlLossTMO\s*[:=]"

# Check for ConnectAllNVMEDevices usage to ensure all calls include the new parameter
echo "Checking ConnectAllNVMEDevices usage..."
rg -p "ConnectAllNVMEDevices\s*\(" -A 3

Length of output: 1958

pkg/nvmeclient/nvme_client.go (2)

31-32: LGTM!

The imports are properly organized.

512-513: LGTM!

The parameter is correctly passed through to the underlying function.

pkg/clientconfig/cache.go (2)

68-68: LGTM: Function signature update is clean and well-integrated.

The addition of the ctrlLossTMO parameter to newConnection aligns well with the new functionality.

70-73: LGTM: Connection initialization is properly handled.

The CtrlLossTMO field is correctly initialized along with other connection properties.

pkg/clientconfig/conf_parser.go

muliby-lb

@yogev-lb seems legit, but:

does this change the default behavior or do you only get a different behavior if you set ctrl-loss-tmo to something other than -1?
how did you test this? here are the discovery-client tests we have in the CI, at least 279 is a must:

    -> /home/muli/lightbits/src/systests/racktests/1923_etcd_mTLS_cross_server_connectivity_new_installation.py [CI: sanity]
    -> /home/muli/lightbits/src/systests/racktests/2301_physical_alma_8_installation_sanity.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/239_alma_8_vm_installation_test.py [CI: sanity minisanity]
    -> /home/muli/lightbits/src/systests/racktests/279_duros_discovery_client.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/280_monitoring_installation_test.py [CI: weekly monitoring-stack]
    -> /home/muli/lightbits/src/systests/racktests/529_discovery_vm_installation_add_node.py [CI: weekly]
    -> /home/muli/lightbits/src/systests/racktests/609_vm_installation_pull.py [CI: nightly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/625_vm_installation_add_new_server_to_cluster.py [CI: weekly]

@elada-lb does the PR check work already? can we try it on this PR?

pkg/clientconfig/conf_parser.go

muliby-lb · 2024-11-21T18:44:47Z

btw @yogev-lb gentle reminder, this is a public repo

muliby-lb · 2024-11-21T18:45:03Z

CC @arturm-lb @anton-lb

cmd/connect-all.go

application/app.go

cmd/serve.go

yogev-lb · 2024-11-24T08:07:53Z

@muliby-lb to answer your Q:

that is correct - default was and still is -1 which means endless wait.
I verified it with DMS CI which does about 600 vol attachments. Also @alon-lb added negative scenario to verify that this fix is indeed working as expected. (@alon-lb please provide link to PR here)
I counted on the CI to run basic DC functionallity but i will run a few - like 279 - and report.

elada-lb · 2024-11-24T09:43:20Z

@yogev-lb seems legit, but:

does this change the default behavior or do you only get a different behavior if you set ctrl-loss-tmo to something other than -1?
how did you test this? here are the discovery-client tests we have in the CI, at least 279 is a must:

    -> /home/muli/lightbits/src/systests/racktests/1923_etcd_mTLS_cross_server_connectivity_new_installation.py [CI: sanity]
    -> /home/muli/lightbits/src/systests/racktests/2301_physical_alma_8_installation_sanity.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/239_alma_8_vm_installation_test.py [CI: sanity minisanity]
    -> /home/muli/lightbits/src/systests/racktests/279_duros_discovery_client.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/280_monitoring_installation_test.py [CI: weekly monitoring-stack]
    -> /home/muli/lightbits/src/systests/racktests/529_discovery_vm_installation_add_node.py [CI: weekly]
    -> /home/muli/lightbits/src/systests/racktests/609_vm_installation_pull.py [CI: nightly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/625_vm_installation_add_new_server_to_cluster.py [CI: weekly]

@elada-lb does the PR check work already? can we try it on this PR?

Yea we should be good to go with running it

muliby-lb · 2024-11-24T10:49:44Z

Yea we should be good to go with running it

@elada-lb thanks can you point me at the documentation on how to run it?

elada-lb · 2024-11-24T12:57:25Z

Yea we should be good to go with running it

@elada-lb thanks can you point me at the documentation on how to run it?

You can find it here (note this is still a WIP)

muliby-lb · 2024-11-25T09:24:14Z

PR check running here: https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311 (thanks @elada-lb)

muliby-lb · 2024-11-26T07:23:44Z

PR check running here: https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311 (thanks @elada-lb)

@elada-lb doesn't look like it ran much... Can you take a look?

elada-lb · 2024-11-26T08:18:40Z

PR check running here: https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311 (thanks @elada-lb)

@elada-lb doesn't look like it ran much... Can you take a look?
You can see in https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311/job/33466540385#step:3:28 it prints out this URL, which leads to the internal link to the CI run

muliby-lb · 2024-11-26T08:39:42Z

You can see in https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311/job/33466540385#step:3:28 it prints out this URL, which leads to the internal link to the CI run

Thanks @elada-lb, I see it now. Looking at the workflow file, I see

      manifest_branch:
        description: "Manifest branch to run the PR checks with"
        default: "master"
        required: true
        type: string

which unless I am mistaken means it ran on the master branch and not on this PR branch, yogev/fix-ctrl-loss-tmo?

yogev-lb · 2024-11-26T09:35:31Z

@muliby-lb @ronen-lb The CI passed for the DC acroding to this run:

https://github.com/lightbitslabs/lbcitests/actions/runs/12006959193

waiting for approval...

muliby-lb · 2024-11-26T09:46:50Z

@yogev-lb (1) I am not sure the PR check actually ran on the right branch, waiting for @elada-lb to confirm; (2) are all of the comments you received on this PR addressed? I usually just resolve those that have been handled ("I disagree because <reason>" is also a legit way to handle a comment).

muliby-lb · 2024-11-26T12:33:38Z

@yogev-lb I ran 239 a couple of times manually and it fails both times because the dsc exits since it can't find /dev/nvme-fabrics. Please consider whether this is indeed the right behavior, given that the dsc is a service, maybe we better revert to the old behavior and wait until the module will be loaded?

@elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check.

elada-lb · 2024-11-26T13:11:56Z

@yogev-lb I ran 239 a couple of times manually and it fails both times because the dsc exits since it can't find /dev/nvme-fabrics. Please consider whether this is indeed the right behavior, given that the dsc is a service, maybe we better revert to the old behavior and wait until the module will be loaded?

@elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check.

I'm pretty sure that's the case, we're currently looking into this.

elada-lb · 2024-11-26T15:19:49Z

@yogev-lb I ran 239 a couple of times manually and it fails both times because the dsc exits since it can't find /dev/nvme-fabrics. Please consider whether this is indeed the right behavior, given that the dsc is a service, maybe we better revert to the old behavior and wait until the module will be loaded?
@elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check.

I'm pretty sure that's the case, we're currently looking into this.

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

muliby-lb · 2024-11-27T05:35:59Z

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

Thanks @elada-lb, I'm giving it another spin now.

muliby-lb · 2024-11-27T05:45:46Z

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

Thanks @elada-lb, I'm giving it another spin now.

@elada-lb failed in https://github.com/LightBitsLabs/lbcitests/actions/runs/12044433121/job/33581436912, can you please take a look?

elada-lb · 2024-11-27T08:30:54Z

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

Thanks @elada-lb, I'm giving it another spin now.

@elada-lb failed in https://github.com/LightBitsLabs/lbcitests/actions/runs/12044433121/job/33581436912, can you please take a look?

@muliby-lb Why'd you pass manifest_branch=yogev/fix-ctrl-loss-tmo and not just the default master? If I understand correctly, this isn't a branch that exists in the manifest repository, so it fails.

muliby-lb · 2024-11-27T08:32:37Z

@muliby-lb Why'd you pass manifest_branch=yogev/fix-ctrl-loss-tmo and not just the default master? If I understand correctly, this isn't a branch that exists in the manifest repository, so it fails.

@elada-lb quite possibly I misunderstood how it's meant to work. Where do I specify which branch/PR to run the PR check on?

till now we didn't support this flag in the DC configuraion. it was added only on the connect command line but not as a service. needed to update the parser of the file, update the cache and pass this information to the ConnectAll command from the service. we differneciate between entries to the cache coming from user/referrals the reason we do that is that a user may provide ctrl-loss-tmo on a server that is not the default, and in case a referral will notify on a new server added we will not have this value. when we add the referral entry we will go over all the user entries and if one of them is not nil we will use this one and apply if to the referral one. sadly the DS will not be able to provide us this info about the new server we connected. issue: LBM1-35562

there are still OS out that does not load nvme-tcp by default. this will cause the DC to not function properly when it tries to connect to the io-controller. today we will see it only in the logs but the server will run successfully and it is tricky to locate. this change will fail-fast, on service load and will report what in high probablility needed to be done: nvme-tcp.

today when we fail on application error we will see the help msg which is not relevant for these errors and is confusing.

if we want to run liveness probes on discovery-client we need to run it from the container. when a service exposes an endpoint - like this one exposing http it is good practice to expose an healthz endpoint that can be invoked by k8s or docker-compose and indicate about container status.

If the interval will be 0 we will panic on: `panic: non-positive interval for NewTicker` set the deafult to 5s to prevent such issues Signed-off-by: alon <alon@lightbitslabs.com>

We notice that there may be a race between writing and closing the file and shooting down the host abruptly. Signed-off-by: alon <alon@lightbitslabs.com>

We do not need to be depends on any external repo's version. this is an independed repo that need to controll it's own version Signed-off-by: alon <alon@lightbitslabs.com>

sonarqubecloud · 2025-01-27T05:54:14Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

muliby-lb · 2025-02-05T15:21:45Z

@yogev-lb I don't see a +1 on this from anyone, myself included... I'm happy to re-review and +1 if warranted, but it's better if it will come from someone else who is familiar with the code you're changing.

alon-lb · 2025-02-06T06:42:22Z

cmd/serve.go

 		Short:             "Start NVMeOF Discovery Client",
 		Long:              ``,
 		DisableAutoGenTag: true,
+		SilenceUsage:      true,


yogev-lb · 2025-02-06T09:02:13Z

@elada-lb @muliby-lb

for some reason the discovery-client tests dont run as part of this PR or im missing something?

At some point 2 months ago i think you said something about a new PR check suite for this applications

muliby-lb · 2025-02-06T12:09:34Z

@yogev-lb see the README.md for how to run PR checks on this (public) repo.
Please re-assign to me when you have completed your testing and it's ready for merging.

muliby-lb · 2025-02-09T10:06:55Z

@yogev-lb ran PR check here: https://github.com/lightbitslabs/lbcitests/actions/runs/13175771922

yogev-lb requested review from guy-lb, maor-lb and muliby-lb November 21, 2024 12:37

coderabbitai bot reviewed Nov 21, 2024

View reviewed changes

pkg/clientconfig/conf_parser.go Outdated Show resolved Hide resolved

yogev-lb requested a review from ronen-lb November 21, 2024 12:41

yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch 2 times, most recently from 06493fa to 188de4b Compare November 21, 2024 13:19

muliby-lb reviewed Nov 21, 2024

View reviewed changes

pkg/clientconfig/conf_parser.go Outdated Show resolved Hide resolved

maor-lb reviewed Nov 24, 2024

View reviewed changes

cmd/connect-all.go Show resolved Hide resolved

application/app.go Show resolved Hide resolved

cmd/serve.go Show resolved Hide resolved

yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 278ca61 to 364fe8d Compare November 26, 2024 19:36

alon-lb force-pushed the yogev/fix-ctrl-loss-tmo branch 2 times, most recently from 55aa786 to 8011d33 Compare January 8, 2025 08:19

yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 8011d33 to c5a449a Compare January 16, 2025 16:39

alon-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from c5a449a to 8011d33 Compare January 19, 2025 07:32

yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 8011d33 to f45b96f Compare January 21, 2025 12:21

yogev-lb and others added 7 commits January 21, 2025 15:26

serve cmd: set SilenceUsage to true

c980fc3

today when we fail on application error we will see the help msg which is not relevant for these errors and is confusing.

app_config.go: set default reconnectInterval in case not provided

cd616ba

If the interval will be 0 we will panic on: `panic: non-positive interval for NewTicker` set the deafult to 5s to prevent such issues Signed-off-by: alon <alon@lightbitslabs.com>

cache.go: flush tmp file to ensure data persisted to disk

c1a3780

We notice that there may be a race between writing and closing the file and shooting down the host abruptly. Signed-off-by: alon <alon@lightbitslabs.com>

remove an annoying warning of unhandled inotify event.

d38da1f

yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from f45b96f to 8d61b4e Compare January 21, 2025 15:26

yogev-lb and others added 2 commits January 27, 2025 05:53

cosmetics: rename variable for readability

e149eb9

Makefile, VERSION: manage our own version

2f09aae

We do not need to be depends on any external repo's version. this is an independed repo that need to controll it's own version Signed-off-by: alon <alon@lightbitslabs.com>

yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 8d61b4e to 2f09aae Compare January 27, 2025 05:53

yogev-lb assigned muliby-lb Feb 4, 2025

muliby-lb assigned yogev-lb Feb 5, 2025

alon-lb approved these changes Feb 6, 2025

View reviewed changes

cmd/serve.go

Short: "Start NVMeOF Discovery Client",

Long: ``,

DisableAutoGenTag: true,

SilenceUsage: true,

Copy link

Contributor

alon-lb Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

muliby-lb removed their assignment Feb 6, 2025

yogev-lb assigned muliby-lb and unassigned yogev-lb Feb 6, 2025

muliby-lb added the m1 ready for merge - prio 1 label Feb 9, 2025

muliby-lb merged commit d221bd7 into main Feb 9, 2025
4 of 5 checks passed

muliby-lb deleted the yogev/fix-ctrl-loss-tmo branch February 9, 2025 11:35

Conversation

yogev-lb commented Nov 21, 2024 • edited by muliby-lb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

muliby-lb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

muliby-lb commented Nov 21, 2024

Uh oh!

muliby-lb commented Nov 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yogev-lb commented Nov 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elada-lb commented Nov 24, 2024

Uh oh!

muliby-lb commented Nov 24, 2024

Uh oh!

elada-lb commented Nov 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muliby-lb commented Nov 25, 2024

Uh oh!

muliby-lb commented Nov 26, 2024

Uh oh!

elada-lb commented Nov 26, 2024

Uh oh!

muliby-lb commented Nov 26, 2024

Uh oh!

yogev-lb commented Nov 26, 2024

Uh oh!

muliby-lb commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muliby-lb commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elada-lb commented Nov 26, 2024

Uh oh!

elada-lb commented Nov 26, 2024

Uh oh!

muliby-lb commented Nov 27, 2024

Uh oh!

muliby-lb commented Nov 27, 2024

Uh oh!

elada-lb commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muliby-lb commented Nov 27, 2024

Uh oh!

sonarqubecloud bot commented Jan 27, 2025

Quality Gate passed

Uh oh!

muliby-lb commented Feb 5, 2025

Uh oh!

alon-lb Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

yogev-lb commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yogev-lb commented Nov 21, 2024 •

edited by muliby-lb

Loading

coderabbitai bot commented Nov 21, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

yogev-lb commented Nov 24, 2024 •

edited

Loading

elada-lb commented Nov 24, 2024 •

edited

Loading

muliby-lb commented Nov 26, 2024 •

edited

Loading

muliby-lb commented Nov 26, 2024 •

edited

Loading

elada-lb commented Nov 27, 2024 •

edited

Loading

yogev-lb commented Feb 6, 2025 •

edited

Loading