chore(conductor): Archive completed tracks
This commit is contained in:
@@ -0,0 +1,5 @@
|
||||
# Track fix_odroid8_and_script_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "fix_odroid8_and_script_20260208",
|
||||
"type": "bug",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T17:00:00Z",
|
||||
"updated_at": "2026-02-08T17:00:00Z",
|
||||
"description": "fix odroid8 node - stuck in a loop because of a LiteFS Cluster ID mismatch in the Consul lease. Also fix script execution errors."
|
||||
}
|
||||
26
conductor/archive/fix_odroid8_and_script_20260208/plan.md
Normal file
26
conductor/archive/fix_odroid8_and_script_20260208/plan.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
|
||||
|
||||
## Phase 1: Script Robustness [x] [checkpoint: 860000b]
|
||||
- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully
|
||||
- [x] Write tests for handling Nomad CLI absence/failure
|
||||
- [x] Update implementation to return descriptive error objects or `None` without crashing
|
||||
- [x] Task: Update aggregator and formatter to handle Nomad errors
|
||||
- [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail
|
||||
- [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells
|
||||
- [x] Add a global "Nomad Connectivity Warning" to the summary
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Odroid8 Recovery [ ]
|
||||
- [x] Task: Identify and verify `odroid8` LiteFS data path
|
||||
- [x] Run `nomad alloc status` to find the volume mount for `odroid8`
|
||||
- [x] Provide the user with the exact host path to the LiteFS data
|
||||
- [x] Task: Guide user through manual cleanup
|
||||
- [x] Provide steps to stop the allocation
|
||||
- [x] Provide the `rm` command to clear the LiteFS metadata
|
||||
- [x] Provide steps to restart and verify the node
|
||||
- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Final Verification [x]
|
||||
- [x] Task: Final verification run of the script
|
||||
- [x] Task: Verify cluster health in Consul and LiteFS API
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
27
conductor/archive/fix_odroid8_and_script_20260208/spec.md
Normal file
27
conductor/archive/fix_odroid8_and_script_20260208/spec.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Specification: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
|
||||
|
||||
## Overview
|
||||
Address the "critical" loop on node `odroid8` caused by a LiteFS Cluster ID mismatch and improve the `cluster_status` script's error handling when the Nomad CLI is unavailable or misconfigured.
|
||||
|
||||
## Functional Requirements
|
||||
- **Node Recovery (`odroid8`):**
|
||||
- Identify the specific LiteFS data directory on `odroid8` (usually `/var/lib/litefs` inside the container, mapped to a host path).
|
||||
- Guide the user through stopping the allocation and wiping the metadata/data to resolve the Consul lease conflict.
|
||||
- **Script Robustness:**
|
||||
- Update `nomad_client.py` to handle `subprocess` failures more gracefully.
|
||||
- If a `nomad` command fails, the script should not print a traceback or confusing "non-zero exit status" messages to the primary output.
|
||||
- Instead, it should log the error to `stderr` and continue, marking the affected fields (like logs or full ID) as "Nomad Error".
|
||||
- Add a clear warning in the output if Nomad connectivity is lost, suggesting the user verify `NOMAD_ADDR`.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Reliability:** The script should remain functional even if one of the underlying tools (Nomad CLI) is broken.
|
||||
- **Ease of Use:** Provide clear, copy-pasteable commands for the manual cleanup process.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] `odroid8` node successfully joins the cluster and shows as `passing` in Consul.
|
||||
- [ ] `cluster_status` script runs without error even if the `nomad` binary is missing or cannot connect to the server (showing fallback info).
|
||||
- [ ] Script provides a helpful message when `nomad` commands fail.
|
||||
|
||||
## Out of Scope
|
||||
- Fixing the Navidrome database path issue (this will be handled in a separate track once the cluster is stable).
|
||||
- Automating the host-level cleanup (manual guidance only).
|
||||
Reference in New Issue
Block a user