diff --git a/conductor/archive/diagnose_and_enhance_20260208/index.md b/conductor/archive/diagnose_and_enhance_20260208/index.md new file mode 100644 index 0000000..978a9e6 --- /dev/null +++ b/conductor/archive/diagnose_and_enhance_20260208/index.md @@ -0,0 +1,5 @@ +# Track diagnose_and_enhance_20260208 Context + +- [Specification](./spec.md) +- [Implementation Plan](./plan.md) +- [Metadata](./metadata.json) diff --git a/conductor/archive/diagnose_and_enhance_20260208/metadata.json b/conductor/archive/diagnose_and_enhance_20260208/metadata.json new file mode 100644 index 0000000..f3255f9 --- /dev/null +++ b/conductor/archive/diagnose_and_enhance_20260208/metadata.json @@ -0,0 +1,8 @@ +{ + "track_id": "diagnose_and_enhance_20260208", + "type": "feature", + "status": "new", + "created_at": "2026-02-08T16:00:00Z", + "updated_at": "2026-02-08T16:00:00Z", + "description": "diagnose why odroid8 shows as critical - why is the script not showing uptime or replication lag?" +} diff --git a/conductor/archive/diagnose_and_enhance_20260208/plan.md b/conductor/archive/diagnose_and_enhance_20260208/plan.md new file mode 100644 index 0000000..edfa0cf --- /dev/null +++ b/conductor/archive/diagnose_and_enhance_20260208/plan.md @@ -0,0 +1,30 @@ +# Plan: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`) + +## Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b] +- [x] Task: Update `consul_client.py` to fetch detailed health check output + - [x] Write tests for fetching `Output` field from Consul checks + - [x] Implement logic to extract and store the `Output` (error message) +- [x] Task: Update aggregator and formatter to display Consul errors + - [x] Update aggregation logic to include `consul_error` + - [x] Update table formatter to indicate an error (maybe a flag or color) + - [x] Add a "Diagnostics" section to the output to print full error details +- [x] Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md) + +## Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729] +- [x] Task: Implement `nomad_client.py` wrapper + - [x] Write tests for `get_allocation_logs`, `get_node_status`, and `restart_allocation` (mocking subprocess) + - [x] Implement `subprocess.run(["nomad", ...])` logic to fetch logs and restart allocations +- [x] Task: Integrate Nomad logs into diagnosis + - [x] Update aggregator to call Nomad client for critical nodes + - [x] Update "Diagnostics" section to display the last 20 lines of stderr +- [x] Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md) + +## Phase 3: Advanced LiteFS Status [ ] +- [x] Task: Implement `litefs_status` via `nomad alloc exec` + - [x] Write tests for executing remote commands via Nomad + - [x] Update `litefs_client.py` to fallback to `nomad alloc exec` if HTTP fails + - [x] Parse `litefs status` output (text/json) to extract uptime and replication lag +- [x] Task: Final Polish and Diagnosis Run + - [x] Ensure all pieces work together + - [x] Run the script to diagnose `odroid8` +- [~] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md) diff --git a/conductor/archive/diagnose_and_enhance_20260208/spec.md b/conductor/archive/diagnose_and_enhance_20260208/spec.md new file mode 100644 index 0000000..0095b27 --- /dev/null +++ b/conductor/archive/diagnose_and_enhance_20260208/spec.md @@ -0,0 +1,32 @@ +# Specification: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`) + +## Overview +Investigate the root cause of the "critical" status on node `odroid8` and enhance the `cluster_status` script to provide deeper diagnostic information, including Consul health check details, Nomad logs, and LiteFS status via `nomad alloc exec`. + +## Functional Requirements +- **Consul Diagnostics:** + - Update `consul_client.py` to fetch the `Output` field from Consul health checks. + - Display the specific error message from Consul when a node is in a non-passing state. +- **Nomad Integration:** + - Implement a `nomad_client.py` (or similar) to interact with the Nomad CLI. + - **Log Retrieval:** For nodes with health issues, fetch the last 20 lines of `stderr` from the relevant Nomad allocation. + - **Internal Status:** Implement a method to run `litefs status` inside the container using `nomad alloc exec` to retrieve uptime and replication lag when the HTTP API is unavailable. +- **LiteFS API Investigation:** + - Investigate why the `/status` endpoint returns a 404 and attempt to resolve it via configuration or by identifying the correct endpoint for LiteFS 0.5. +- **Output Formatting:** + - Update the CLI table to display the retrieved uptime and replication lag. + - Add a "Diagnostics" section at the bottom of the output when errors are detected, showing the Consul check output and Nomad logs. + +## Non-Functional Requirements +- **CLI Dependency:** The script assumes the `nomad` CLI is installed and configured in the local system's PATH. +- **Language:** Python 3.x. + +## Acceptance Criteria +- [ ] Script displays the specific reason for `odroid8`'s critical status from Consul. +- [ ] Script successfully retrieves and displays Nomad stderr logs for failing nodes. +- [ ] Script displays LiteFS uptime and replication lag (retrieved via `alloc exec` or HTTP). +- [ ] The root cause of `odroid8`'s failure is identified and reported. + +## Out of Scope +- Automatic fixing of the identified issues (diagnosis only). +- Direct SSH into nodes (using Nomad/Consul APIs/CLIs only). diff --git a/conductor/archive/fix_odroid8_and_script_20260208/index.md b/conductor/archive/fix_odroid8_and_script_20260208/index.md new file mode 100644 index 0000000..9ec7f3c --- /dev/null +++ b/conductor/archive/fix_odroid8_and_script_20260208/index.md @@ -0,0 +1,5 @@ +# Track fix_odroid8_and_script_20260208 Context + +- [Specification](./spec.md) +- [Implementation Plan](./plan.md) +- [Metadata](./metadata.json) diff --git a/conductor/archive/fix_odroid8_and_script_20260208/metadata.json b/conductor/archive/fix_odroid8_and_script_20260208/metadata.json new file mode 100644 index 0000000..1e24775 --- /dev/null +++ b/conductor/archive/fix_odroid8_and_script_20260208/metadata.json @@ -0,0 +1,8 @@ +{ + "track_id": "fix_odroid8_and_script_20260208", + "type": "bug", + "status": "new", + "created_at": "2026-02-08T17:00:00Z", + "updated_at": "2026-02-08T17:00:00Z", + "description": "fix odroid8 node - stuck in a loop because of a LiteFS Cluster ID mismatch in the Consul lease. Also fix script execution errors." +} diff --git a/conductor/archive/fix_odroid8_and_script_20260208/plan.md b/conductor/archive/fix_odroid8_and_script_20260208/plan.md new file mode 100644 index 0000000..96565c0 --- /dev/null +++ b/conductor/archive/fix_odroid8_and_script_20260208/plan.md @@ -0,0 +1,26 @@ +# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`) + +## Phase 1: Script Robustness [x] [checkpoint: 860000b] +- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully + - [x] Write tests for handling Nomad CLI absence/failure + - [x] Update implementation to return descriptive error objects or `None` without crashing +- [x] Task: Update aggregator and formatter to handle Nomad errors + - [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail + - [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells + - [x] Add a global "Nomad Connectivity Warning" to the summary +- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md) + +## Phase 2: Odroid8 Recovery [ ] +- [x] Task: Identify and verify `odroid8` LiteFS data path + - [x] Run `nomad alloc status` to find the volume mount for `odroid8` + - [x] Provide the user with the exact host path to the LiteFS data +- [x] Task: Guide user through manual cleanup + - [x] Provide steps to stop the allocation + - [x] Provide the `rm` command to clear the LiteFS metadata + - [x] Provide steps to restart and verify the node +- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md) + +## Phase 3: Final Verification [x] +- [x] Task: Final verification run of the script +- [x] Task: Verify cluster health in Consul and LiteFS API +- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md) diff --git a/conductor/archive/fix_odroid8_and_script_20260208/spec.md b/conductor/archive/fix_odroid8_and_script_20260208/spec.md new file mode 100644 index 0000000..f0aa7bb --- /dev/null +++ b/conductor/archive/fix_odroid8_and_script_20260208/spec.md @@ -0,0 +1,27 @@ +# Specification: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`) + +## Overview +Address the "critical" loop on node `odroid8` caused by a LiteFS Cluster ID mismatch and improve the `cluster_status` script's error handling when the Nomad CLI is unavailable or misconfigured. + +## Functional Requirements +- **Node Recovery (`odroid8`):** + - Identify the specific LiteFS data directory on `odroid8` (usually `/var/lib/litefs` inside the container, mapped to a host path). + - Guide the user through stopping the allocation and wiping the metadata/data to resolve the Consul lease conflict. +- **Script Robustness:** + - Update `nomad_client.py` to handle `subprocess` failures more gracefully. + - If a `nomad` command fails, the script should not print a traceback or confusing "non-zero exit status" messages to the primary output. + - Instead, it should log the error to `stderr` and continue, marking the affected fields (like logs or full ID) as "Nomad Error". + - Add a clear warning in the output if Nomad connectivity is lost, suggesting the user verify `NOMAD_ADDR`. + +## Non-Functional Requirements +- **Reliability:** The script should remain functional even if one of the underlying tools (Nomad CLI) is broken. +- **Ease of Use:** Provide clear, copy-pasteable commands for the manual cleanup process. + +## Acceptance Criteria +- [ ] `odroid8` node successfully joins the cluster and shows as `passing` in Consul. +- [ ] `cluster_status` script runs without error even if the `nomad` binary is missing or cannot connect to the server (showing fallback info). +- [ ] Script provides a helpful message when `nomad` commands fail. + +## Out of Scope +- Fixing the Navidrome database path issue (this will be handled in a separate track once the cluster is stable). +- Automating the host-level cleanup (manual guidance only). diff --git a/conductor/tracks.md b/conductor/tracks.md index dee3a3e..0c684b3 100644 --- a/conductor/tracks.md +++ b/conductor/tracks.md @@ -1,14 +1,4 @@ # Project Tracks This file tracks all major tracks for the project. Each track has its own detailed plan in its respective folder. ---- - ---- - -- [~] **Track: diagnose why odroid8 shows as critical - why is the script not showing uptime or replication lag?** -*Link: [./tracks/diagnose_and_enhance_20260208/](./tracks/diagnose_and_enhance_20260208/)* - ---- - -- [x] **Track: fix odroid8 node - stuck in a loop because of a LiteFS Cluster ID mismatch in the Consul lease. Also fix script execution errors.** -*Link: [./tracks/fix_odroid8_and_script_20260208/](./tracks/fix_odroid8_and_script_20260208/)* +--- \ No newline at end of file