fix(ota-hil): set recovery mode before capture logs#903
fix(ota-hil): set recovery mode before capture logs#903chrisgalanis wants to merge 89 commits intomainfrom
Conversation
hil/src/commands/ota/mod.rs
Outdated
There was a problem hiding this comment.
i think we can remove comments like this one, they don't bring much value
| info!("Overlays wiped successfully, rebooting device"); | ||
| info!("Overlays wiped successfully"); | ||
|
|
||
| if !self.skip_time_sync_before_reboot { |
There was a problem hiding this comment.
why we need this I wonder ?
There was a problem hiding this comment.
So, basically the problem was with the worldcoin-key-retrival service. With pearl, teh harware clock is not working so if it NTP server is not synced then worldcoin-attest is failing and also key-retrieval. This should be fixed with new key-retrieval fix, that is why i ketp it as a flag, so we can test older commits, that don't have key-retrival fixed
There was a problem hiding this comment.
I am also not getting it sorry.
With pearl, teh harware clock is not working so if it NTP server is not synced then worldcoin-attest is failing and also key-retrieval.
That is not true. The clock will synchronize and https requests is going to succeed. And the delay is not big for that
There was a problem hiding this comment.
By default this is not correct. Because reboot is manually triggered faster than the NTP server syncrhonizes the clock....
hil/src/commands/ota/reboot.rs
Outdated
There was a problem hiding this comment.
timings like this might be weak spot and break often - but if works now i am fine
There was a problem hiding this comment.
I have tested and look reliable enough. Even though if we don't ahve logs in the future we will now the root of failure and fix it.
There was a problem hiding this comment.
Better then make it a constant
hil/src/commands/ota/reboot.rs
Outdated
There was a problem hiding this comment.
i think it's bad practice to have static in the end of the name
| } | ||
|
|
||
| if start_time.elapsed() >= timeout && !found_login_prompt { | ||
| if found_login_prompt { |
There was a problem hiding this comment.
where do we set timeout for this?
There was a problem hiding this comment.
The timeout is the basically if the login prompth is fouund or not. If login prompt is never found, then the timeout of ssh will fail later so the process will exit with error
hil/src/commands/ota/system.rs
Outdated
There was a problem hiding this comment.
better to define it in top of the function for readibility
hil/src/boot.rs
Outdated
There was a problem hiding this comment.
move to the top of the file or maybe have as separate arg to cli so you don't tune it in sources every time?
|
|
||
| use super::Ota; | ||
|
|
||
| const DELAY_CAPTURE_LOGS: u64 = 200; |
There was a problem hiding this comment.
nit: better to specify MS here as it's not clear what 200 is (seconds, ms, etc.)
|
|
||
| const MAX_ATTEMPTS: u32 = 60; // 60 attempts = 2 minutes max wait | ||
| const SLEEP_DURATION: Duration = Duration::from_secs(2); | ||
| // Timeout for individual command execution (10 seconds is generous for timedatectl/chronyc) |
There was a problem hiding this comment.
nit: (10 seconds is generous for timedatectl/chronyc) is not helpful as a comment
hil/src/boot.rs
Outdated
There was a problem hiding this comment.
You are not using the const here
There was a problem hiding this comment.
this is only blocknig comment - everything else is great job!
c044b7c to
2c89aa4
Compare
6bed7d8 to
65dacd1
Compare
This pull request introduces several improvements and enhancements to the OTA update and reboot process, focusing on reliability, observability, and configurability. Key changes include more robust time synchronization checks, improved handling of boot log capture, and better control over the reboot sequence to ensure devices do not inadvertently enter recovery mode.
OTA Update Flow Improvements:
--skip-time-sync-before-rebootflag to allow skipping the NTP time synchronization check before the first reboot, with clear logging and fallback behavior. [1] [2]Reboot and Recovery Handling:
Boot Log Capture Enhancements:
Time Synchronization Robustness:
timedatectlorchronyc, with command timeouts and more robust output parsing. If neither tool is found, a clear error is raised.Utility and Observability Improvements:
These changes collectively improve the reliability, observability, and flexibility of the OTA update and device reboot process, making it easier to diagnose issues and adapt to different environments.
OTA update and logging enhancements:
--skip-time-sync-before-rebootflag to optionally skip NTP check before the first reboot, with clear logging and fallback. [1] [2]Reboot sequence and recovery pin handling:
Boot log capture improvements:
Time synchronization checks:
timedatectlorchronycfor time sync, with timeouts and robust output parsing. Raises error if neither is found.Utility and reliability improvements: