chore(conductor): Archive completed tracks

This commit is contained in:
2026-02-08 13:34:48 -08:00
parent e9c52d2847
commit ee297f8d8e
9 changed files with 142 additions and 11 deletions

View File

@@ -0,0 +1,5 @@
# Track diagnose_and_enhance_20260208 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -0,0 +1,8 @@
{
"track_id": "diagnose_and_enhance_20260208",
"type": "feature",
"status": "new",
"created_at": "2026-02-08T16:00:00Z",
"updated_at": "2026-02-08T16:00:00Z",
"description": "diagnose why odroid8 shows as critical - why is the script not showing uptime or replication lag?"
}

View File

@@ -0,0 +1,30 @@
# Plan: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
## Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]
- [x] Task: Update `consul_client.py` to fetch detailed health check output
- [x] Write tests for fetching `Output` field from Consul checks
- [x] Implement logic to extract and store the `Output` (error message)
- [x] Task: Update aggregator and formatter to display Consul errors
- [x] Update aggregation logic to include `consul_error`
- [x] Update table formatter to indicate an error (maybe a flag or color)
- [x] Add a "Diagnostics" section to the output to print full error details
- [x] Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)
## Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]
- [x] Task: Implement `nomad_client.py` wrapper
- [x] Write tests for `get_allocation_logs`, `get_node_status`, and `restart_allocation` (mocking subprocess)
- [x] Implement `subprocess.run(["nomad", ...])` logic to fetch logs and restart allocations
- [x] Task: Integrate Nomad logs into diagnosis
- [x] Update aggregator to call Nomad client for critical nodes
- [x] Update "Diagnostics" section to display the last 20 lines of stderr
- [x] Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)
## Phase 3: Advanced LiteFS Status [ ]
- [x] Task: Implement `litefs_status` via `nomad alloc exec`
- [x] Write tests for executing remote commands via Nomad
- [x] Update `litefs_client.py` to fallback to `nomad alloc exec` if HTTP fails
- [x] Parse `litefs status` output (text/json) to extract uptime and replication lag
- [x] Task: Final Polish and Diagnosis Run
- [x] Ensure all pieces work together
- [x] Run the script to diagnose `odroid8`
- [~] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)

View File

@@ -0,0 +1,32 @@
# Specification: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
## Overview
Investigate the root cause of the "critical" status on node `odroid8` and enhance the `cluster_status` script to provide deeper diagnostic information, including Consul health check details, Nomad logs, and LiteFS status via `nomad alloc exec`.
## Functional Requirements
- **Consul Diagnostics:**
- Update `consul_client.py` to fetch the `Output` field from Consul health checks.
- Display the specific error message from Consul when a node is in a non-passing state.
- **Nomad Integration:**
- Implement a `nomad_client.py` (or similar) to interact with the Nomad CLI.
- **Log Retrieval:** For nodes with health issues, fetch the last 20 lines of `stderr` from the relevant Nomad allocation.
- **Internal Status:** Implement a method to run `litefs status` inside the container using `nomad alloc exec` to retrieve uptime and replication lag when the HTTP API is unavailable.
- **LiteFS API Investigation:**
- Investigate why the `/status` endpoint returns a 404 and attempt to resolve it via configuration or by identifying the correct endpoint for LiteFS 0.5.
- **Output Formatting:**
- Update the CLI table to display the retrieved uptime and replication lag.
- Add a "Diagnostics" section at the bottom of the output when errors are detected, showing the Consul check output and Nomad logs.
## Non-Functional Requirements
- **CLI Dependency:** The script assumes the `nomad` CLI is installed and configured in the local system's PATH.
- **Language:** Python 3.x.
## Acceptance Criteria
- [ ] Script displays the specific reason for `odroid8`'s critical status from Consul.
- [ ] Script successfully retrieves and displays Nomad stderr logs for failing nodes.
- [ ] Script displays LiteFS uptime and replication lag (retrieved via `alloc exec` or HTTP).
- [ ] The root cause of `odroid8`'s failure is identified and reported.
## Out of Scope
- Automatic fixing of the identified issues (diagnosis only).
- Direct SSH into nodes (using Nomad/Consul APIs/CLIs only).

View File

@@ -0,0 +1,5 @@
# Track fix_odroid8_and_script_20260208 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -0,0 +1,8 @@
{
"track_id": "fix_odroid8_and_script_20260208",
"type": "bug",
"status": "new",
"created_at": "2026-02-08T17:00:00Z",
"updated_at": "2026-02-08T17:00:00Z",
"description": "fix odroid8 node - stuck in a loop because of a LiteFS Cluster ID mismatch in the Consul lease. Also fix script execution errors."
}

View File

@@ -0,0 +1,26 @@
# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
## Phase 1: Script Robustness [x] [checkpoint: 860000b]
- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully
- [x] Write tests for handling Nomad CLI absence/failure
- [x] Update implementation to return descriptive error objects or `None` without crashing
- [x] Task: Update aggregator and formatter to handle Nomad errors
- [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail
- [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells
- [x] Add a global "Nomad Connectivity Warning" to the summary
- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md)
## Phase 2: Odroid8 Recovery [ ]
- [x] Task: Identify and verify `odroid8` LiteFS data path
- [x] Run `nomad alloc status` to find the volume mount for `odroid8`
- [x] Provide the user with the exact host path to the LiteFS data
- [x] Task: Guide user through manual cleanup
- [x] Provide steps to stop the allocation
- [x] Provide the `rm` command to clear the LiteFS metadata
- [x] Provide steps to restart and verify the node
- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md)
## Phase 3: Final Verification [x]
- [x] Task: Final verification run of the script
- [x] Task: Verify cluster health in Consul and LiteFS API
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)

View File

@@ -0,0 +1,27 @@
# Specification: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
## Overview
Address the "critical" loop on node `odroid8` caused by a LiteFS Cluster ID mismatch and improve the `cluster_status` script's error handling when the Nomad CLI is unavailable or misconfigured.
## Functional Requirements
- **Node Recovery (`odroid8`):**
- Identify the specific LiteFS data directory on `odroid8` (usually `/var/lib/litefs` inside the container, mapped to a host path).
- Guide the user through stopping the allocation and wiping the metadata/data to resolve the Consul lease conflict.
- **Script Robustness:**
- Update `nomad_client.py` to handle `subprocess` failures more gracefully.
- If a `nomad` command fails, the script should not print a traceback or confusing "non-zero exit status" messages to the primary output.
- Instead, it should log the error to `stderr` and continue, marking the affected fields (like logs or full ID) as "Nomad Error".
- Add a clear warning in the output if Nomad connectivity is lost, suggesting the user verify `NOMAD_ADDR`.
## Non-Functional Requirements
- **Reliability:** The script should remain functional even if one of the underlying tools (Nomad CLI) is broken.
- **Ease of Use:** Provide clear, copy-pasteable commands for the manual cleanup process.
## Acceptance Criteria
- [ ] `odroid8` node successfully joins the cluster and shows as `passing` in Consul.
- [ ] `cluster_status` script runs without error even if the `nomad` binary is missing or cannot connect to the server (showing fallback info).
- [ ] Script provides a helpful message when `nomad` commands fail.
## Out of Scope
- Fixing the Navidrome database path issue (this will be handled in a separate track once the cluster is stable).
- Automating the host-level cleanup (manual guidance only).

View File

@@ -2,13 +2,3 @@
This file tracks all major tracks for the project. Each track has its own detailed plan in its respective folder. This file tracks all major tracks for the project. Each track has its own detailed plan in its respective folder.
--- ---
---
- [~] **Track: diagnose why odroid8 shows as critical - why is the script not showing uptime or replication lag?**
*Link: [./tracks/diagnose_and_enhance_20260208/](./tracks/diagnose_and_enhance_20260208/)*
---
- [x] **Track: fix odroid8 node - stuck in a loop because of a LiteFS Cluster ID mismatch in the Consul lease. Also fix script execution errors.**
*Link: [./tracks/fix_odroid8_and_script_20260208/](./tracks/fix_odroid8_and_script_20260208/)*