Compare commits

...

27 Commits

Author SHA1 Message Date
5e1c7c116e conductor(plan): Mark Phase 2 tasks as complete
All checks were successful
Build and Push Docker Image / build-and-push (push) Successful in 38s
2026-02-08 13:59:55 -08:00
9cd5455378 feat(entrypoint): Refactor entrypoint for leadership-aware process management and clean up register.sh 2026-02-08 13:59:37 -08:00
e9b26f6eb9 conductor(plan): Mark task 'Update nomad job' as complete 2026-02-08 13:58:42 -08:00
5cbb657c45 feat(config): Correct storage paths in Nomad job to use LiteFS and shared mounts 2026-02-08 13:58:26 -08:00
c9bf1410c9 conductor(plan): Mark task 'Update Dockerfile' as complete 2026-02-08 13:57:59 -08:00
ef91b8e9af feat(config): Update Dockerfile to use simplified LiteFS ENTRYPOINT 2026-02-08 13:57:43 -08:00
f51baf7949 conductor(plan): Mark task 'Update litefs.yml' as complete 2026-02-08 13:57:18 -08:00
396dfeb7a3 feat(config): Enable exec block in litefs.yml to trigger entrypoint.sh 2026-02-08 13:57:00 -08:00
ee297f8d8e chore(conductor): Archive completed tracks 2026-02-08 13:34:48 -08:00
e9c52d2847 docs(conductor): Synchronize docs for track 'fix_odroid8_and_script' 2026-02-08 13:31:47 -08:00
b8a12b1285 chore(conductor): Mark track 'fix_odroid8_and_script' as complete 2026-02-08 13:28:14 -08:00
6d1b473efa conductor(plan): Mark phase 'Phase 1: Script Robustness' as complete 2026-02-08 11:16:56 -08:00
860000bd04 conductor(checkpoint): Checkpoint end of Phase 1 2026-02-08 11:15:55 -08:00
22ec8a5cc0 conductor(plan): Mark phase 'Phase 2: Nomad Integration and Logs' as complete 2026-02-08 07:55:12 -08:00
6d77729a4a conductor(checkpoint): Checkpoint end of Phase 2 2026-02-08 07:54:49 -08:00
a686c5b225 conductor(checkpoint): Checkpoint end of Phase 1 2026-02-08 07:49:56 -08:00
7c0c146d0c feat(diagnose): Update Consul client to fetch health check output and display diagnostics 2026-02-08 07:44:22 -08:00
3c4c1c4d80 chore(conductor): Cleanup tracked files for archived tracks 2026-02-08 07:34:24 -08:00
f367f93768 chore(conductor): Archive track 'cluster_status_python' 2026-02-08 07:34:21 -08:00
c7e7c9fd7b docs(conductor): Synchronize docs for track 'cluster_status_python' 2026-02-08 07:33:48 -08:00
cbd109a8bc chore(conductor): Mark track 'cluster_status_python' as complete 2026-02-08 06:18:13 -08:00
c0dcb1a47d conductor(plan): Mark phase 'Phase 3: Data Processing and Formatting' as complete 2026-02-08 06:17:17 -08:00
20d99be67d conductor(checkpoint): Checkpoint end of Phase 3 2026-02-08 06:17:06 -08:00
16aad2958a conductor(plan): Mark phase 'Phase 2: Core Data Fetching' as complete 2026-02-08 06:03:45 -08:00
90ffed531f conductor(checkpoint): Checkpoint end of Phase 2 2026-02-08 06:03:38 -08:00
1749cc29ee conductor(plan): Mark phase 'Phase 1: Environment and Project Structure' as complete 2026-02-08 05:53:35 -08:00
e71d5e2ffc conductor(checkpoint): Checkpoint end of Phase 1 2026-02-08 05:53:27 -08:00
44 changed files with 1386 additions and 151 deletions

3
.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
.venv/
__pycache__/
*.pyc

View File

@@ -12,13 +12,14 @@ RUN apk add --no-cache fuse3 ca-certificates bash curl
COPY --from=litefs /usr/local/bin/litefs /usr/local/bin/litefs
# Copy scripts
COPY register.sh /usr/local/bin/register.sh
COPY entrypoint.sh /usr/local/bin/entrypoint.sh
RUN chmod +x /usr/local/bin/register.sh /usr/local/bin/entrypoint.sh
RUN chmod +x /usr/local/bin/entrypoint.sh
# Copy LiteFS configuration
COPY litefs.yml /etc/litefs.yml
# LiteFS becomes the supervisor.
# It will mount the FUSE fs and then execute our entrypoint script.
ENTRYPOINT ["litefs", "mount", "--", "/usr/local/bin/entrypoint.sh"]
# It will mount the FUSE fs and then execute the command defined in litefs.yml's exec section.
ENTRYPOINT ["litefs", "mount"]

View File

@@ -0,0 +1,5 @@
# Track cluster_status_python_20260208 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -0,0 +1,8 @@
{
"track_id": "cluster_status_python_20260208",
"type": "feature",
"status": "new",
"created_at": "2026-02-08T15:00:00Z",
"updated_at": "2026-02-08T15:00:00Z",
"description": "create a script that runs on my local system (i don't run consul locally) that: - check consul services are registered correctly - diplays the expected state (who is primary, what replicas exist) - show basic litefs status info for each node"
}

View File

@@ -0,0 +1,31 @@
# Plan: Cluster Status Script (`cluster_status_python`)
## Phase 1: Environment and Project Structure [x] [checkpoint: e71d5e2]
- [x] Task: Initialize Python project structure (venv, requirements.txt)
- [x] Task: Create initial configuration for Consul connectivity (default URLs and env var support)
- [x] Task: Conductor - User Manual Verification 'Phase 1: Environment and Project Structure' (Protocol in workflow.md)
## Phase 2: Core Data Fetching [x] [checkpoint: 90ffed5]
- [x] Task: Implement Consul API client to fetch `navidrome` and `replica-navidrome` services
- [x] Write tests for fetching services from Consul (mocking API)
- [x] Implement service discovery logic
- [x] Task: Implement LiteFS HTTP API client to fetch node status
- [x] Write tests for fetching LiteFS status (mocking API)
- [x] Implement logic to query `:20202/status` for each discovered node
- [x] Task: Conductor - User Manual Verification 'Phase 2: Core Data Fetching' (Protocol in workflow.md)
## Phase 3: Data Processing and Formatting [x] [checkpoint: 20d99be]
- [x] Task: Implement data aggregation logic
- [x] Write tests for aggregating Consul and LiteFS data into a single cluster state object
- [x] Implement logic to calculate overall cluster health and role assignment
- [x] Task: Implement CLI output formatting (Table and Color)
- [x] Write tests for table formatting and color-coding logic
- [x] Implement `tabulate` based output with a health summary
- [x] Task: Conductor - User Manual Verification 'Phase 3: Data Processing and Formatting' (Protocol in workflow.md)
## Phase 4: CLI Interface and Final Polishing [x]
- [x] Task: Implement command-line arguments (argparse)
- [x] Write tests for CLI argument parsing (Consul URL overrides, etc.)
- [x] Finalize the `main` entry point
- [x] Task: Final verification of script against requirements
- [x] Task: Conductor - User Manual Verification 'Phase 4: CLI Interface and Final Polishing' (Protocol in workflow.md)

View File

@@ -0,0 +1,40 @@
# Specification: Cluster Status Script (`cluster_status_python`)
## Overview
Create a Python-based CLI script to be run on a local system (outside the cluster) to monitor the health and status of the Navidrome LiteFS/Consul cluster. This tool will bridge the gap for local monitoring without needing a local Consul instance.
## Functional Requirements
- **Consul Connectivity:**
- Connect to a remote Consul instance.
- Default to a hardcoded URL with support for overrides via command-line arguments (e.g., `--consul-url`) or environment variables (`CONSUL_HTTP_ADDR`).
- Assume no Consul authentication token is required.
- **Service Discovery:**
- Query Consul for the `navidrome` (Primary) and `replica-navidrome` (Replica) services.
- Verify that services are registered correctly and health checks are passing.
- **Status Reporting:**
- Display a text-based table summarizing the state of all nodes in the cluster.
- Color-coded output for quick health assessment.
- Include a summary section at the top indicating overall cluster health.
- **Node-Level Details:**
- Role identification (Primary vs. Replica).
- Uptime of the LiteFS process.
- Advertise URL for each node.
- Replication Lag (for Replicas).
- Write-forwarding proxy target (for Replicas).
## Non-Functional Requirements
- **Language:** Python 3.x.
- **Dependencies:** Use standard libraries or common packages like `requests` for API calls and `tabulate` for table formatting.
- **Portability:** Must run on Linux (user's OS) without requiring local Consul or Nomad binaries.
## Acceptance Criteria
- [ ] Script successfully retrieves service list from remote Consul.
- [ ] Script correctly identifies the current Primary node based on Consul tags/service names.
- [ ] Script queries the LiteFS HTTP API (`:20202/status`) on each node to gather internal metrics.
- [ ] Output is formatted as a clear, readable text table.
- [ ] Overrides for Consul URL are functional.
## Out of Scope
- Direct interaction with Nomad API (Consul is the source of truth for this script).
- Database-level inspection (SQL queries).
- Remote log tailing.

View File

@@ -1,4 +1,4 @@
# Track fix_routing_20260207 Context
# Track diagnose_and_enhance_20260208 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)

View File

@@ -0,0 +1,8 @@
{
"track_id": "diagnose_and_enhance_20260208",
"type": "feature",
"status": "new",
"created_at": "2026-02-08T16:00:00Z",
"updated_at": "2026-02-08T16:00:00Z",
"description": "diagnose why odroid8 shows as critical - why is the script not showing uptime or replication lag?"
}

View File

@@ -0,0 +1,30 @@
# Plan: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
## Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]
- [x] Task: Update `consul_client.py` to fetch detailed health check output
- [x] Write tests for fetching `Output` field from Consul checks
- [x] Implement logic to extract and store the `Output` (error message)
- [x] Task: Update aggregator and formatter to display Consul errors
- [x] Update aggregation logic to include `consul_error`
- [x] Update table formatter to indicate an error (maybe a flag or color)
- [x] Add a "Diagnostics" section to the output to print full error details
- [x] Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)
## Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]
- [x] Task: Implement `nomad_client.py` wrapper
- [x] Write tests for `get_allocation_logs`, `get_node_status`, and `restart_allocation` (mocking subprocess)
- [x] Implement `subprocess.run(["nomad", ...])` logic to fetch logs and restart allocations
- [x] Task: Integrate Nomad logs into diagnosis
- [x] Update aggregator to call Nomad client for critical nodes
- [x] Update "Diagnostics" section to display the last 20 lines of stderr
- [x] Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)
## Phase 3: Advanced LiteFS Status [ ]
- [x] Task: Implement `litefs_status` via `nomad alloc exec`
- [x] Write tests for executing remote commands via Nomad
- [x] Update `litefs_client.py` to fallback to `nomad alloc exec` if HTTP fails
- [x] Parse `litefs status` output (text/json) to extract uptime and replication lag
- [x] Task: Final Polish and Diagnosis Run
- [x] Ensure all pieces work together
- [x] Run the script to diagnose `odroid8`
- [~] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)

View File

@@ -0,0 +1,32 @@
# Specification: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
## Overview
Investigate the root cause of the "critical" status on node `odroid8` and enhance the `cluster_status` script to provide deeper diagnostic information, including Consul health check details, Nomad logs, and LiteFS status via `nomad alloc exec`.
## Functional Requirements
- **Consul Diagnostics:**
- Update `consul_client.py` to fetch the `Output` field from Consul health checks.
- Display the specific error message from Consul when a node is in a non-passing state.
- **Nomad Integration:**
- Implement a `nomad_client.py` (or similar) to interact with the Nomad CLI.
- **Log Retrieval:** For nodes with health issues, fetch the last 20 lines of `stderr` from the relevant Nomad allocation.
- **Internal Status:** Implement a method to run `litefs status` inside the container using `nomad alloc exec` to retrieve uptime and replication lag when the HTTP API is unavailable.
- **LiteFS API Investigation:**
- Investigate why the `/status` endpoint returns a 404 and attempt to resolve it via configuration or by identifying the correct endpoint for LiteFS 0.5.
- **Output Formatting:**
- Update the CLI table to display the retrieved uptime and replication lag.
- Add a "Diagnostics" section at the bottom of the output when errors are detected, showing the Consul check output and Nomad logs.
## Non-Functional Requirements
- **CLI Dependency:** The script assumes the `nomad` CLI is installed and configured in the local system's PATH.
- **Language:** Python 3.x.
## Acceptance Criteria
- [ ] Script displays the specific reason for `odroid8`'s critical status from Consul.
- [ ] Script successfully retrieves and displays Nomad stderr logs for failing nodes.
- [ ] Script displays LiteFS uptime and replication lag (retrieved via `alloc exec` or HTTP).
- [ ] The root cause of `odroid8`'s failure is identified and reported.
## Out of Scope
- Automatic fixing of the identified issues (diagnosis only).
- Direct SSH into nodes (using Nomad/Consul APIs/CLIs only).

View File

@@ -0,0 +1,5 @@
# Track fix_odroid8_and_script_20260208 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -0,0 +1,8 @@
{
"track_id": "fix_odroid8_and_script_20260208",
"type": "bug",
"status": "new",
"created_at": "2026-02-08T17:00:00Z",
"updated_at": "2026-02-08T17:00:00Z",
"description": "fix odroid8 node - stuck in a loop because of a LiteFS Cluster ID mismatch in the Consul lease. Also fix script execution errors."
}

View File

@@ -0,0 +1,26 @@
# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
## Phase 1: Script Robustness [x] [checkpoint: 860000b]
- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully
- [x] Write tests for handling Nomad CLI absence/failure
- [x] Update implementation to return descriptive error objects or `None` without crashing
- [x] Task: Update aggregator and formatter to handle Nomad errors
- [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail
- [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells
- [x] Add a global "Nomad Connectivity Warning" to the summary
- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md)
## Phase 2: Odroid8 Recovery [ ]
- [x] Task: Identify and verify `odroid8` LiteFS data path
- [x] Run `nomad alloc status` to find the volume mount for `odroid8`
- [x] Provide the user with the exact host path to the LiteFS data
- [x] Task: Guide user through manual cleanup
- [x] Provide steps to stop the allocation
- [x] Provide the `rm` command to clear the LiteFS metadata
- [x] Provide steps to restart and verify the node
- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md)
## Phase 3: Final Verification [x]
- [x] Task: Final verification run of the script
- [x] Task: Verify cluster health in Consul and LiteFS API
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)

View File

@@ -0,0 +1,27 @@
# Specification: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
## Overview
Address the "critical" loop on node `odroid8` caused by a LiteFS Cluster ID mismatch and improve the `cluster_status` script's error handling when the Nomad CLI is unavailable or misconfigured.
## Functional Requirements
- **Node Recovery (`odroid8`):**
- Identify the specific LiteFS data directory on `odroid8` (usually `/var/lib/litefs` inside the container, mapped to a host path).
- Guide the user through stopping the allocation and wiping the metadata/data to resolve the Consul lease conflict.
- **Script Robustness:**
- Update `nomad_client.py` to handle `subprocess` failures more gracefully.
- If a `nomad` command fails, the script should not print a traceback or confusing "non-zero exit status" messages to the primary output.
- Instead, it should log the error to `stderr` and continue, marking the affected fields (like logs or full ID) as "Nomad Error".
- Add a clear warning in the output if Nomad connectivity is lost, suggesting the user verify `NOMAD_ADDR`.
## Non-Functional Requirements
- **Reliability:** The script should remain functional even if one of the underlying tools (Nomad CLI) is broken.
- **Ease of Use:** Provide clear, copy-pasteable commands for the manual cleanup process.
## Acceptance Criteria
- [ ] `odroid8` node successfully joins the cluster and shows as `passing` in Consul.
- [ ] `cluster_status` script runs without error even if the `nomad` binary is missing or cannot connect to the server (showing fallback info).
- [ ] Script provides a helpful message when `nomad` commands fail.
## Out of Scope
- Fixing the Navidrome database path issue (this will be handled in a separate track once the cluster is stable).
- Automating the host-level cleanup (manual guidance only).

View File

@@ -16,3 +16,6 @@
## Networking
- **Traefik:** Acts as the cluster's ingress controller, routing traffic based on Consul tags.
- **LiteFS Proxy:** Handles transparent write-forwarding to the cluster leader.
## Monitoring & Tooling
- **Python (Cluster Status Script):** A local CLI tool for monitoring Consul service registration, LiteFS replication status, and diagnosing Nomad allocation logs.

View File

@@ -1,3 +1,4 @@
# Project Tracks
This file tracks all major tracks for the project. Each track has its own detailed plan in its respective folder.
This file tracks all major tracks for the project. Each track has its own detailed plan in its respective folder.
---

View File

@@ -0,0 +1,30 @@
# Plan: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
## Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]
- [x] Task: Update `consul_client.py` to fetch detailed health check output
- [x] Write tests for fetching `Output` field from Consul checks
- [x] Implement logic to extract and store the `Output` (error message)
- [x] Task: Update aggregator and formatter to display Consul errors
- [x] Update aggregation logic to include `consul_error`
- [x] Update table formatter to indicate an error (maybe a flag or color)
- [x] Add a "Diagnostics" section to the output to print full error details
- [x] Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)
## Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]
- [x] Task: Implement `nomad_client.py` wrapper
- [x] Write tests for `get_allocation_logs`, `get_node_status`, and `restart_allocation` (mocking subprocess)
- [x] Implement `subprocess.run(["nomad", ...])` logic to fetch logs and restart allocations
- [x] Task: Integrate Nomad logs into diagnosis
- [x] Update aggregator to call Nomad client for critical nodes
- [x] Update "Diagnostics" section to display the last 20 lines of stderr
- [x] Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)
## Phase 3: Advanced LiteFS Status [ ]
- [ ] Task: Implement `litefs_status` via `nomad alloc exec`
- [ ] Write tests for executing remote commands via Nomad
- [ ] Update `litefs_client.py` to fallback to `nomad alloc exec` if HTTP fails
- [ ] Parse `litefs status` output (text/json) to extract uptime and replication lag
- [ ] Task: Final Polish and Diagnosis Run
- [ ] Ensure all pieces work together
- [ ] Run the script to diagnose `odroid8`
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)

View File

@@ -0,0 +1,22 @@
# Plan: Fix LiteFS Configuration and Process Management (`fix_litefs_config`)
## Phase 1: Configuration and Image Structure [ ]
- [x] Task: Update `litefs.yml` to include the `exec` block (396dfeb)
- [x] Task: Update `Dockerfile` to use LiteFS as the supervisor (`ENTRYPOINT ["litefs", "mount"]`) (ef91b8e)
- [x] Task: Update `navidrome-litefs-v2.nomad` with corrected storage paths (`ND_DATAFOLDER`, `ND_CACHEFOLDER`, `ND_BACKUP_PATH`) (5cbb657)
- [ ] Task: Conductor - User Manual Verification 'Phase 1: Configuration and Image Structure' (Protocol in workflow.md)
## Phase 2: Entrypoint and Registration Logic [x] [checkpoint: 9cd5455]
- [x] Task: Refactor `entrypoint.sh` to handle leadership-aware process management (9cd5455)
- [x] Integrate Consul registration logic (from `register.sh`)
- [x] Implement loop to start/stop Navidrome based on `/data/.primary` existence
- [x] Ensure proper signal handling for Navidrome shutdown
- [x] Task: Clean up redundant scripts (e.g., `register.sh` if fully integrated) (9cd5455)
- [ ] Task: Conductor - User Manual Verification 'Phase 2: Entrypoint and Registration Logic' (Protocol in workflow.md)
## Phase 3: Deployment and Failover Verification [ ]
- [ ] Task: Build and push the updated Docker image via Gitea Actions (if possible) or manual trigger
- [ ] Task: Deploy the updated Nomad job
- [ ] Task: Verify cluster health and process distribution using `cluster_status` script
- [ ] Task: Perform a manual failover (stop primary allocation) and verify Navidrome migrates correctly
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Failover Verification' (Protocol in workflow.md)

View File

@@ -0,0 +1,26 @@
# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
## Phase 1: Script Robustness [x] [checkpoint: 860000b]
- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully
- [x] Write tests for handling Nomad CLI absence/failure
- [x] Update implementation to return descriptive error objects or `None` without crashing
- [x] Task: Update aggregator and formatter to handle Nomad errors
- [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail
- [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells
- [x] Add a global "Nomad Connectivity Warning" to the summary
- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md)
## Phase 2: Odroid8 Recovery [ ]
- [x] Task: Identify and verify `odroid8` LiteFS data path
- [x] Run `nomad alloc status` to find the volume mount for `odroid8`
- [x] Provide the user with the exact host path to the LiteFS data
- [x] Task: Guide user through manual cleanup
- [x] Provide steps to stop the allocation
- [x] Provide the `rm` command to clear the LiteFS metadata
- [x] Provide steps to restart and verify the node
- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md)
## Phase 3: Final Verification [x]
- [x] Task: Final verification run of the script
- [x] Task: Verify cluster health in Consul and LiteFS API
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)

View File

@@ -1,8 +0,0 @@
{
"track_id": "fix_routing_20260207",
"type": "bug",
"status": "new",
"created_at": "2026-02-07T17:36:00Z",
"updated_at": "2026-02-07T17:36:00Z",
"description": "fix routing - use litefs to register the navidrome service with consul. the serivce should point to the master and avoid the litefs proxy (it breaks navidrome)"
}

View File

@@ -1,25 +0,0 @@
# Implementation Plan: Direct Primary Routing for Navidrome-LiteFS
This plan outlines the steps to reconfigure the Navidrome-LiteFS cluster to bypass the LiteFS write-forwarding proxy and use direct primary node routing for improved reliability and performance.
## Phase 1: Infrastructure Configuration Update [checkpoint: 5a57902]
In this phase, we will modify the Nomad job and LiteFS configuration to support direct port access and primary-aware health checks.
- [x] Task: Update `navidrome-litefs-v2.nomad` to point service directly to Navidrome port
- [x] Modify `service` block to use port 4533 instead of dynamic mapped port.
- [x] Replace HTTP health check with a script check running `litefs is-primary`.
- [x] Task: Update `litefs.yml` to ensure consistent internal API binding (if needed)
- [x] Task: Conductor - User Manual Verification 'Infrastructure Configuration Update' (Protocol in workflow.md)
## Phase 2: Deployment and Validation
In this phase, we will deploy the changes and verify that the cluster correctly handles primary election and routing.
- [x] Task: Deploy updated Nomad job
- [x] Execute `nomad job run navidrome-litefs-v2.nomad`.
- [x] Task: Verify Consul health status
- [x] Confirm that only the LiteFS primary node is marked as `passing`.
- [x] Confirm that replica nodes are marked as `critical`.
- [x] Task: Verify Ingress Routing
- [x] Confirm Traefik correctly routes traffic only to the primary node.
- [x] Verify that Navidrome is accessible and functional.
- [x] Task: Conductor - User Manual Verification 'Deployment and Validation' (Protocol in workflow.md)

View File

@@ -1,26 +0,0 @@
# Specification: Direct Primary Routing for Navidrome-LiteFS
## Overview
This track aims to fix routing issues caused by the LiteFS proxy. We will reconfigure the Nomad service registration to point directly to the Navidrome process (port 4533) on the primary node, bypassing the LiteFS write-forwarding proxy (port 8080). To ensure Traefik only routes traffic to the node capable of writes, we will implement a "Primary-only" health check.
## Functional Requirements
- **Direct Port Mapping:** Update the Nomad `service` block to use the host port `4533` directly instead of the LiteFS proxy port.
- **Primary-Aware Health Check:** Replace the standard HTTP health check with a script check.
- **Check Logic:** The script will execute `litefs is-primary`.
- If the node is the primary, the command exits with `0` (Passing).
- If the node is a replica, the command exits with a non-zero code (Critical).
- **Service Tags:** Retain all existing Traefik tags so ingress routing continues to work.
## Non-Functional Requirements
- **Failover Reliability:** In the event of a leader election, the old primary must become unhealthy and the new primary must become healthy in Consul, allowing Traefik to update its backends automatically.
- **Minimal Latency:** Bypassing the proxy eliminates the extra network hop for reads and potential compatibility issues with Navidrome's connection handling.
## Acceptance Criteria
- [ ] Consul reports the service as `passing` only on the node currently holding the LiteFS primary lease.
- [ ] Consul reports the service as `critical` on all replica nodes.
- [ ] Traefik correctly routes traffic to the primary node.
- [ ] Navidrome is accessible and functions correctly without the LiteFS proxy intermediary.
## Out of Scope
- Modifying Navidrome internal logic.
- Implementing an external health-check responder.

View File

@@ -1,8 +1,78 @@
#!/bin/bash
set -e
# Start the registration loop in the background, redirecting output to stderr so we see it in Nomad logs
/usr/local/bin/register.sh >&2 &
# Configuration from environment
SERVICE_NAME="navidrome"
PORT=4533
CONSUL_HTTP_ADDR="${CONSUL_URL:-http://localhost:8500}"
NODE_IP="${ADVERTISE_IP}"
# Start Navidrome
echo "Starting Navidrome..."
/app/navidrome
# Tags for the Primary service (Traefik enabled)
PRIMARY_TAGS='["navidrome","web","traefik.enable=true","urlprefix-/navidrome","tools","traefik.http.routers.navidromelan.rule=Host(`navidrome.service.dc1.consul`)","traefik.http.routers.navidromewan.rule=Host(`m.fbleagh.duckdns.org`)","traefik.http.routers.navidromewan.middlewares=dex@consulcatalog","traefik.http.routers.navidromewan.tls=true"]'
NAVIDROME_PID=""
SERVICE_ID="navidrome-${NODE_IP}-${SERVICE_NAME}"
cleanup() {
echo "Caught signal, shutting down..."
if [ -n "$NAVIDROME_PID" ]; then
echo "Stopping Navidrome (PID: $NAVIDROME_PID)..."
kill -TERM "$NAVIDROME_PID"
wait "$NAVIDROME_PID" || true
fi
echo "Deregistering service ${SERVICE_ID} from Consul..."
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/deregister/${SERVICE_ID}" || true
exit 0
}
trap cleanup SIGTERM SIGINT
echo "Starting leadership-aware entrypoint..."
echo "Node IP: $NODE_IP"
echo "Consul: $CONSUL_HTTP_ADDR"
while true; do
# In LiteFS 0.5, .primary file exists ONLY on replicas.
if [ ! -f /data/.primary ]; then
# PRIMARY STATE
if [ -z "$NAVIDROME_PID" ] || ! kill -0 "$NAVIDROME_PID" 2>/dev/null; then
echo "Node is Primary. Initializing Navidrome..."
# Register in Consul
echo "Registering as primary in Consul..."
curl -s -X PUT -d "{
\"ID\": \"${SERVICE_ID}\",
\"Name\": \"${SERVICE_NAME}\",
\"Tags\": ${PRIMARY_TAGS},
\"Address\": \"${NODE_IP}\",
\"Port\": ${PORT},
\"Check\": {
\"HTTP\": \"http://${NODE_IP}:${PORT}/app\",
\"Interval\": \"10s\",
\"Timeout\": \"2s\"
}
}" "${CONSUL_HTTP_ADDR}/v1/agent/service/register"
# Start Navidrome
/app/navidrome &
NAVIDROME_PID=$!
echo "Navidrome started with PID $NAVIDROME_PID"
fi
else
# REPLICA STATE
if [ -n "$NAVIDROME_PID" ] && kill -0 "$NAVIDROME_PID" 2>/dev/null; then
echo "Node transitioned to Replica. Stopping Navidrome..."
kill -TERM "$NAVIDROME_PID"
wait "$NAVIDROME_PID" || true
NAVIDROME_PID=""
echo "Deregistering primary service from Consul..."
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/deregister/${SERVICE_ID}" || true
fi
# We don't register anything for replicas in this version to keep it simple.
# But we stay alive so LiteFS keeps running.
fi
sleep 5
done

View File

@@ -32,6 +32,5 @@ proxy:
- "*.svg"
# Commands to run only on the primary node.
# This serves as a primary-only health check responder.
# exec:
# - cmd: "nc -lk -p 8082 -e echo -e 'HTTP/1.1 200 OK\r\nContent-Length: 7\r\n\r\nPrimary'"
exec:
- cmd: "/usr/local/bin/entrypoint.sh"

View File

@@ -8,7 +8,7 @@ job "navidrome-litefs" {
}
group "navidrome" {
count = 2
count = 4
update {
max_parallel = 1
@@ -57,11 +57,12 @@ job "navidrome-litefs" {
PORT = "8080" # Internal proxy port (unused but kept)
# Navidrome Config
ND_DATAFOLDER = "/local/data"
ND_CACHEFOLDER = "/shared_data/cache"
ND_CONFIGFILE = "/local/data/navidrome.toml"
ND_DATAFOLDER = "/data"
ND_CACHEFOLDER = "/shared_data/cache"
ND_BACKUP_PATH = "/shared_data/backup"
ND_CONFIGFILE = "/data/navidrome.toml"
# Database is on the LiteFS FUSE mount
# Database is on the LiteFS FUSE mount. Optimized for SQLite.
ND_DBPATH = "/data/navidrome.db?_busy_timeout=30000&_journal_mode=WAL&_foreign_keys=on&synchronous=NORMAL"
ND_SCANSCHEDULE = "0"

View File

@@ -1,73 +0,0 @@
#!/bin/bash
set -x
# Configuration
SERVICE_NAME="navidrome"
REPLICA_SERVICE_NAME="replica-navidrome"
PORT=4533
CONSUL_HTTP_ADDR="${CONSUL_URL:-http://localhost:8500}"
NODE_IP="${ADVERTISE_IP}"
CHECK_INTERVAL="10s"
# Tags for the Primary service (Traefik enabled)
PRIMARY_TAGS='["navidrome","web","traefik.enable=true","urlprefix-/navidrome","tools","traefik.http.routers.navidromelan.rule=Host(`navidrome.service.dc1.consul`)","traefik.http.routers.navidromewan.rule=Host(`m.fbleagh.duckdns.org`)","traefik.http.routers.navidromewan.middlewares=dex@consulcatalog","traefik.http.routers.navidromewan.tls=true"]'
# Tags for the Replica service
REPLICA_TAGS='["navidrome-replica"]'
register_service() {
local name=$1
local tags=$2
local id="navidrome-${NODE_IP}-${name}"
echo "Registering as ${name} with ID ${id} at ${NODE_IP}:${PORT}..."
curl -v -X PUT -d "{
\"ID\": \"${id}\",
\"Name\": \"${name}\",
\"Tags\": ${tags},
\"Address\": \"${NODE_IP}\",
\"Port\": ${PORT},
\"Check\": {
\"HTTP\": \"http://${NODE_IP}:${PORT}/app\",
\"Interval\": \"${CHECK_INTERVAL}\",
\"Timeout\": \"2s\"
}
}" "${CONSUL_HTTP_ADDR}/v1/agent/service/register"
}
deregister_service() {
local name=$1
local id="navidrome-${NODE_IP}-${name}"
echo "Deregistering ${name} with ID ${id}..."
curl -v -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/deregister/${id}"
}
echo "Starting Consul registration loop..."
LAST_STATE="unknown"
while true; do
# In LiteFS 0.5, .primary file exists ONLY on replicas.
# We check /data/.primary because /data is our mount point.
if [ ! -f /data/.primary ]; then
CURRENT_STATE="primary"
else
CURRENT_STATE="replica"
fi
echo "Current node state: ${CURRENT_STATE}"
if [ "$CURRENT_STATE" != "$LAST_STATE" ]; then
echo "State change detected: ${LAST_STATE} -> ${CURRENT_STATE}"
if [ "$CURRENT_STATE" == "primary" ]; then
deregister_service "$REPLICA_SERVICE_NAME"
register_service "$SERVICE_NAME" "$PRIMARY_TAGS"
else
deregister_service "$SERVICE_NAME"
register_service "$REPLICA_SERVICE_NAME" "$REPLICA_TAGS"
fi
LAST_STATE="$CURRENT_STATE"
fi
sleep 15
done

View File

@@ -0,0 +1,11 @@
.PHONY: setup test run
setup:
python3 -m venv .venv
. .venv/bin/activate && pip install -r requirements.txt
test:
. .venv/bin/activate && pytest -v --cov=.
run:
. .venv/bin/activate && PYTHONPATH=. python3 cli.py

View File

53
scripts/cluster_status/cli.py Executable file
View File

@@ -0,0 +1,53 @@
#!/usr/bin/env python3
import argparse
import sys
import config
import cluster_aggregator
import output_formatter
import nomad_client
def parse_args():
parser = argparse.ArgumentParser(description="Monitor Navidrome LiteFS/Consul cluster status.")
parser.add_argument("--consul-url", help="Override Consul API URL (default from env or hardcoded)")
parser.add_argument("--no-color", action="store_true", help="Disable colorized output")
parser.add_argument("--restart", help="Restart the allocation on the specified node")
return parser.parse_args()
def main():
args = parse_args()
# Resolve Consul URL
consul_url = config.get_consul_url(args.consul_url)
# Handle restart if requested
if args.restart:
print(f"Attempting to restart allocation on node: {args.restart}...")
alloc_id = nomad_client.get_allocation_id(args.restart, "navidrome-litefs")
if alloc_id:
if nomad_client.restart_allocation(alloc_id):
print(f"Successfully sent restart signal to allocation {alloc_id}")
else:
print(f"Failed to restart allocation {alloc_id}")
else:
print(f"Could not find allocation for node {args.restart}")
print("-" * 30)
try:
# Fetch and aggregate data
cluster_data = cluster_aggregator.get_cluster_status(consul_url)
# Format and print output
print(output_formatter.format_summary(cluster_data, use_color=not args.no_color))
print("\n" + output_formatter.format_node_table(cluster_data["nodes"], use_color=not args.no_color))
# Diagnostics
diagnostics = output_formatter.format_diagnostics(cluster_data["nodes"], use_color=not args.no_color)
if diagnostics:
print(diagnostics)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,74 @@
import consul_client
import litefs_client
import nomad_client
def get_cluster_status(consul_url, job_id="navidrome-litefs"):
"""
Aggregates cluster data from Consul, LiteFS, and Nomad.
"""
consul_nodes = consul_client.get_cluster_services(consul_url)
aggregated_nodes = []
is_healthy = True
primary_count = 0
# Check Nomad connectivity
node_map = nomad_client.get_node_map()
nomad_available = bool(node_map)
for node in consul_nodes:
# Fetch allocation ID first to enable nomad exec fallback
alloc_id = nomad_client.get_allocation_id(node["node"], job_id)
litefs_status = litefs_client.get_node_status(node["address"], alloc_id=alloc_id)
# Merge data
node_data = {
**node,
"litefs_primary": litefs_status.get("is_primary", False),
"uptime": litefs_status.get("uptime", "N/A"),
"advertise_url": litefs_status.get("advertise_url", ""),
"replication_lag": litefs_status.get("replication_lag", "N/A"),
"litefs_error": litefs_status.get("error", None),
"nomad_logs": None,
"alloc_id": alloc_id
}
if node["status"] != "passing":
is_healthy = False
# Fetch Nomad logs for critical nodes
if alloc_id:
node_data["nomad_logs"] = nomad_client.get_allocation_logs(alloc_id)
if node_data["litefs_primary"]:
primary_count += 1
# Check for active databases
node_dbs = litefs_status.get("dbs", {})
if node_dbs:
node_data["active_dbs"] = list(node_dbs.keys())
else:
node_data["active_dbs"] = []
aggregated_nodes.append(node_data)
# Final health check
health = "Healthy"
if not is_healthy:
health = "Unhealthy"
elif primary_count == 0:
health = "No Primary Detected"
elif primary_count > 1:
health = "Split Brain Detected (Multiple Primaries)"
# Global warning if no DBs found on any node
all_dbs = [db for n in aggregated_nodes for db in n.get("active_dbs", [])]
if not all_dbs:
health = f"{health} (WARNING: No LiteFS Databases Found)"
return {
"health": health,
"nodes": aggregated_nodes,
"primary_count": primary_count,
"nomad_available": nomad_available
}

View File

@@ -0,0 +1,15 @@
import os
DEFAULT_CONSUL_URL = "http://consul.service.dc1.consul:8500"
def get_consul_url(url_arg=None):
"""
Resolves the Consul URL in the following order:
1. CLI Argument (url_arg)
2. Environment Variable (CONSUL_HTTP_ADDR)
3. Default (http://localhost:8500)
"""
if url_arg:
return url_arg
return os.environ.get("CONSUL_HTTP_ADDR", DEFAULT_CONSUL_URL)

View File

@@ -0,0 +1,56 @@
import requests
def get_cluster_services(consul_url):
"""
Queries Consul health API for navidrome and replica-navidrome services.
Returns a list of dictionaries with node info.
"""
services = []
# Define roles to fetch
role_map = {
"navidrome": "primary",
"replica-navidrome": "replica"
}
for service_name, role in role_map.items():
url = f"{consul_url}/v1/health/service/{service_name}"
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
data = response.json()
for item in data:
node_name = item["Node"]["Node"]
address = item["Node"]["Address"]
port = item["Service"]["Port"]
# Determine overall status from checks and extract output
checks = item.get("Checks", [])
status = "passing"
check_output = ""
for check in checks:
if check["Status"] != "passing":
status = check["Status"]
check_output = check.get("Output", "")
break
else:
# Even if passing, store the output of the first check if it's the only one
if not check_output:
check_output = check.get("Output", "")
services.append({
"node": node_name,
"address": address,
"port": port,
"role": role,
"status": status,
"service_id": item["Service"]["ID"],
"check_output": check_output
})
except Exception as e:
# For now, we just don't add the service if it fails to fetch
# In a real script we might want to report the error
print(f"Error fetching {service_name}: {e}")
return services

View File

@@ -0,0 +1,92 @@
import requests
import nomad_client
import re
def parse_litefs_status(output):
"""
Parses the output of 'litefs status'.
"""
status = {}
# Extract Primary
primary_match = re.search(r"Primary:\s+(true|false)", output, re.IGNORECASE)
if primary_match:
status["is_primary"] = primary_match.group(1).lower() == "true"
# Extract Uptime
uptime_match = re.search(r"Uptime:\s+(\S+)", output)
if uptime_match:
status["uptime"] = uptime_match.group(1)
# Extract Replication Lag
lag_match = re.search(r"Replication Lag:\s+(\S+)", output)
if lag_match:
status["replication_lag"] = lag_match.group(1)
return status
def get_node_status(node_address, port=20202, alloc_id=None):
"""
Queries the LiteFS HTTP API on a specific node for its status.
Tries /status first, then /debug/vars, then falls back to nomad alloc exec.
"""
# 1. Try /status
url = f"http://{node_address}:{port}/status"
try:
response = requests.get(url, timeout=3)
if response.status_code == 200:
data = response.json()
status = {
"is_primary": data.get("primary", False),
"uptime": data.get("uptime", 0),
"advertise_url": data.get("advertiseURL", ""),
"dbs": data.get("dbs", {})
}
if "replicationLag" in data:
status["replication_lag"] = data["replicationLag"]
if "primaryURL" in data:
status["primary_url"] = data["primaryURL"]
return status
except Exception:
pass
# 2. Try /debug/vars
url = f"http://{node_address}:{port}/debug/vars"
try:
response = requests.get(url, timeout=3)
if response.status_code == 200:
data = response.json()
store = data.get("store", {})
status = {
"is_primary": store.get("isPrimary", False),
"uptime": "N/A",
"advertise_url": f"http://{node_address}:{port}",
"dbs": store.get("dbs", {})
}
if "replicationLag" in store:
status["replication_lag"] = store["replicationLag"]
return status
except Exception:
pass
# 3. Fallback to nomad alloc exec
if alloc_id:
try:
output = nomad_client.exec_command(alloc_id, ["litefs", "status"])
if output and "Error" not in output:
parsed_status = parse_litefs_status(output)
if parsed_status:
if "is_primary" not in parsed_status:
parsed_status["is_primary"] = False
if "uptime" not in parsed_status:
parsed_status["uptime"] = "N/A"
parsed_status["advertise_url"] = f"nomad://{alloc_id}"
return parsed_status
except Exception:
pass
return {
"error": "All status retrieval methods failed",
"is_primary": False,
"uptime": "N/A"
}

View File

@@ -0,0 +1,123 @@
import subprocess
import re
import sys
def get_node_map():
"""
Returns a mapping of Node ID to Node Name.
"""
try:
result = subprocess.run(
["nomad", "node", "status"],
capture_output=True, text=True, check=True
)
lines = result.stdout.splitlines()
node_map = {}
for line in lines:
if line.strip() and not line.startswith("ID") and not line.startswith("=="):
parts = re.split(r"\s+", line.strip())
if len(parts) >= 4:
node_map[parts[0]] = parts[3]
return node_map
except FileNotFoundError:
print("Warning: 'nomad' binary not found in PATH.", file=sys.stderr)
return {}
except subprocess.CalledProcessError as e:
print(f"Warning: Failed to query Nomad nodes: {e}", file=sys.stderr)
return {}
except Exception as e:
print(f"Error getting node map: {e}", file=sys.stderr)
return {}
def get_allocation_id(node_name, job_id):
"""
Finds the FULL allocation ID for a specific node and job.
"""
node_map = get_node_map()
try:
result = subprocess.run(
["nomad", "job", "status", job_id],
capture_output=True, text=True, check=True
)
lines = result.stdout.splitlines()
start_parsing = False
for line in lines:
if "Allocations" in line:
start_parsing = True
continue
if start_parsing and line.strip() and not line.startswith("ID") and not line.startswith("=="):
parts = re.split(r"\s+", line.strip())
if len(parts) >= 2:
alloc_id = parts[0]
node_id = parts[1]
resolved_name = node_map.get(node_id, "")
if node_id == node_name or resolved_name == node_name:
# Now get the FULL ID using nomad alloc status
res_alloc = subprocess.run(
["nomad", "alloc", "status", alloc_id],
capture_output=True, text=True, check=True
)
for l in res_alloc.stdout.splitlines():
if l.startswith("ID"):
return l.split("=")[1].strip()
return alloc_id
except FileNotFoundError:
return None # Warning already printed by get_node_map likely
except Exception as e:
print(f"Error getting allocation ID: {e}", file=sys.stderr)
return None
def get_allocation_logs(alloc_id, tail=20):
"""
Fetches the last N lines of stderr for an allocation.
"""
try:
# Try with task name first, then without
try:
result = subprocess.run(
["nomad", "alloc", "logs", "-stderr", "-task", "navidrome", "-n", str(tail), alloc_id],
capture_output=True, text=True, check=True
)
return result.stdout
except subprocess.CalledProcessError:
result = subprocess.run(
["nomad", "alloc", "logs", "-stderr", "-n", str(tail), alloc_id],
capture_output=True, text=True, check=True
)
return result.stdout
except Exception as e:
# Don't print stack trace, just the error
return f"Nomad Error: {str(e)}"
def exec_command(alloc_id, command, task="navidrome"):
"""
Executes a command inside a specific allocation and task.
"""
try:
args = ["nomad", "alloc", "exec", "-task", task, alloc_id] + command
result = subprocess.run(
args,
capture_output=True, text=True, check=True
)
return result.stdout
except Exception as e:
# Don't print stack trace, just return error string
return f"Nomad Error: {str(e)}"
def restart_allocation(alloc_id):
"""
Restarts a specific allocation.
"""
try:
subprocess.run(
["nomad", "alloc", "restart", alloc_id],
capture_output=True, text=True, check=True
)
return True
except Exception as e:
print(f"Error restarting allocation: {e}", file=sys.stderr)
return False

View File

@@ -0,0 +1,115 @@
from tabulate import tabulate
# ANSI Color Codes
GREEN = "\033[92m"
RED = "\033[91m"
CYAN = "\033[96m"
YELLOW = "\033[93m"
RESET = "\033[0m"
BOLD = "\033[1m"
def colorize(text, color, use_color=True):
if not use_color:
return text
return f"{color}{text}{RESET}"
def format_summary(cluster_data, use_color=True):
"""
Formats the cluster health summary.
"""
health = cluster_data["health"]
color = GREEN if health == "Healthy" else RED
if health == "Split Brain Detected (Multiple Primaries)":
color = YELLOW
summary = [
f"{BOLD}Cluster Health:{RESET} {colorize(health, color, use_color)}",
f"{BOLD}Total Nodes:{RESET} {len(cluster_data['nodes'])}",
f"{BOLD}Primaries:{RESET} {cluster_data['primary_count']}",
]
if not cluster_data.get("nomad_available", True):
summary.append(colorize("WARNING: Nomad CLI unavailable or connectivity failed. Logs and uptime may be missing.", RED, use_color))
summary.append("-" * 30)
return "\n".join(summary)
def format_node_table(nodes, use_color=True):
"""
Formats the node list as a table.
"""
headers = ["Node", "Role", "Consul Status", "LiteFS Role", "Uptime", "Lag", "DBs", "LiteFS Info"]
table_data = []
for node in nodes:
# Consul status color
status = node["status"]
status_color = GREEN if status == "passing" else RED
colored_status = colorize(status, status_color, use_color)
# Role color
role = node["role"]
role_color = CYAN if role == "primary" else RESET
colored_role = colorize(role, role_color, use_color)
# LiteFS role color & consistency check
litefs_primary = node["litefs_primary"]
litefs_role = "primary" if litefs_primary else "replica"
# Highlight discrepancy if consul and litefs disagree
litefs_role_color = RESET
if (role == "primary" and not litefs_primary) or (role == "replica" and litefs_primary):
litefs_role_color = YELLOW
litefs_role = f"!! {litefs_role} !!"
elif litefs_primary:
litefs_role_color = CYAN
colored_litefs_role = colorize(litefs_role, litefs_role_color, use_color)
# Error info
info = ""
if node.get("litefs_error"):
info = colorize("LiteFS API Error", RED, use_color)
else:
info = node.get("advertise_url", "")
table_data.append([
colorize(node["node"], BOLD, use_color),
colored_role,
colored_status,
colored_litefs_role,
node.get("uptime", "N/A"),
node.get("replication_lag", "N/A"),
", ".join(node.get("active_dbs", [])),
info
])
return tabulate(table_data, headers=headers, tablefmt="simple")
def format_diagnostics(nodes, use_color=True):
"""
Formats detailed diagnostic information for nodes with errors.
"""
error_nodes = [n for n in nodes if n["status"] != "passing" or n.get("litefs_error")]
if not error_nodes:
return ""
output = ["", colorize("DIAGNOSTICS", BOLD, use_color), "=" * 20]
for node in error_nodes:
output.append(f"\n{BOLD}Node:{RESET} {colorize(node['node'], RED, use_color)}")
if node["status"] != "passing":
output.append(f" {BOLD}Consul Check Status:{RESET} {colorize(node['status'], RED, use_color)}")
if node.get("check_output"):
output.append(f" {BOLD}Consul Check Output:{RESET}\n {node['check_output'].strip()}")
if node.get("nomad_logs"):
output.append(f" {BOLD}Nomad Stderr Logs (last 20 lines):{RESET}\n{node['nomad_logs']}")
if node.get("litefs_error"):
output.append(f" {BOLD}LiteFS API Error:{RESET} {colorize(node['litefs_error'], RED, use_color)}")
return "\n".join(output)

View File

@@ -0,0 +1,4 @@
requests
tabulate
pytest
pytest-cov

View File

View File

@@ -0,0 +1,58 @@
import pytest
from unittest.mock import patch
import cluster_aggregator
@patch("consul_client.get_cluster_services")
@patch("litefs_client.get_node_status")
@patch("nomad_client.get_allocation_id")
@patch("nomad_client.get_allocation_logs")
@patch("nomad_client.get_node_map")
def test_aggregate_cluster_status(mock_node_map, mock_nomad_logs, mock_nomad_id, mock_litefs, mock_consul):
"""Test aggregating Consul and LiteFS data."""
mock_node_map.return_value = {"id": "name"}
# Mock Consul data
mock_consul.return_value = [
{"node": "node1", "address": "1.1.1.1", "role": "primary", "status": "passing"},
{"node": "node2", "address": "1.1.1.2", "role": "replica", "status": "passing"}
]
# Mock LiteFS data
def litefs_side_effect(addr, **kwargs):
if addr == "1.1.1.1":
return {"is_primary": True, "uptime": 100, "advertise_url": "url1", "dbs": {"db1": {}}}
return {"is_primary": False, "uptime": 50, "advertise_url": "url2", "replication_lag": 10, "dbs": {"db1": {}}}
mock_litefs.side_effect = litefs_side_effect
mock_nomad_id.return_value = None
cluster_data = cluster_aggregator.get_cluster_status("http://consul:8500")
assert cluster_data["health"] == "Healthy"
assert len(cluster_data["nodes"]) == 2
node1 = next(n for n in cluster_data["nodes"] if n["node"] == "node1")
assert node1["litefs_primary"] is True
assert node1["role"] == "primary"
node2 = next(n for n in cluster_data["nodes"] if n["node"] == "node2")
assert node2["litefs_primary"] is False
assert node2["replication_lag"] == 10
@patch("consul_client.get_cluster_services")
@patch("litefs_client.get_node_status")
@patch("nomad_client.get_allocation_id")
@patch("nomad_client.get_allocation_logs")
@patch("nomad_client.get_node_map")
def test_aggregate_cluster_status_unhealthy(mock_node_map, mock_nomad_logs, mock_nomad_id, mock_litefs, mock_consul):
"""Test health calculation when nodes are critical."""
mock_node_map.return_value = {}
mock_consul.return_value = [
{"node": "node1", "address": "1.1.1.1", "role": "primary", "status": "critical"}
]
mock_litefs.return_value = {"is_primary": True, "uptime": 100, "dbs": {"db1": {}}}
mock_nomad_id.return_value = "alloc1"
mock_nomad_logs.return_value = "error logs"
cluster_data = cluster_aggregator.get_cluster_status("http://consul:8500")
assert cluster_data["health"] == "Unhealthy"
assert cluster_data["nodes"][0]["nomad_logs"] == "error logs"

View File

@@ -0,0 +1,27 @@
import os
import pytest
import config
def test_default_consul_url():
"""Test that the default Consul URL is returned when no env var is set."""
# Ensure env var is not set
if "CONSUL_HTTP_ADDR" in os.environ:
del os.environ["CONSUL_HTTP_ADDR"]
assert config.get_consul_url() == "http://consul.service.dc1.consul:8500"
def test_env_var_consul_url():
"""Test that the environment variable overrides the default."""
os.environ["CONSUL_HTTP_ADDR"] = "http://10.0.0.1:8500"
try:
assert config.get_consul_url() == "http://10.0.0.1:8500"
finally:
del os.environ["CONSUL_HTTP_ADDR"]
def test_cli_arg_consul_url():
"""Test that the CLI argument overrides everything."""
os.environ["CONSUL_HTTP_ADDR"] = "http://10.0.0.1:8500"
try:
assert config.get_consul_url("http://cli-override:8500") == "http://cli-override:8500"
finally:
del os.environ["CONSUL_HTTP_ADDR"]

View File

@@ -0,0 +1,108 @@
import pytest
from unittest.mock import patch, MagicMock
import consul_client
@patch("requests.get")
def test_get_cluster_services(mock_get):
"""Test fetching healthy services from Consul."""
# Mock responses for navidrome and replica-navidrome
mock_navidrome = [
{
"Node": {"Node": "node1", "Address": "192.168.1.101"},
"Service": {"Service": "navidrome", "Port": 4533, "ID": "navidrome-1"},
"Checks": [{"Status": "passing"}]
}
]
mock_replicas = [
{
"Node": {"Node": "node2", "Address": "192.168.1.102"},
"Service": {"Service": "replica-navidrome", "Port": 4533, "ID": "replica-1"},
"Checks": [{"Status": "passing"}]
},
{
"Node": {"Node": "node3", "Address": "192.168.1.103"},
"Service": {"Service": "replica-navidrome", "Port": 4533, "ID": "replica-2"},
"Checks": [{"Status": "critical"}] # One failing check
}
]
def side_effect(url, params=None, timeout=None):
if "health/service/navidrome" in url:
m = MagicMock()
m.json.return_value = mock_navidrome
m.raise_for_status.return_value = None
return m
elif "health/service/replica-navidrome" in url:
m = MagicMock()
m.json.return_value = mock_replicas
m.raise_for_status.return_value = None
return m
return MagicMock()
mock_get.side_effect = side_effect
consul_url = "http://consul:8500"
services = consul_client.get_cluster_services(consul_url)
# Should find 3 nodes total (node1 primary, node2 healthy replica, node3 critical replica)
assert len(services) == 3
# Check node1 (primary)
node1 = next(s for s in services if s["node"] == "node1")
assert node1["role"] == "primary"
assert node1["status"] == "passing"
assert node1["address"] == "192.168.1.101"
# Check node2 (healthy replica)
node2 = next(s for s in services if s["node"] == "node2")
assert node2["role"] == "replica"
assert node2["status"] == "passing"
# Check node3 (critical replica)
node3 = next(s for s in services if s["node"] == "node3")
assert node3["role"] == "replica"
assert node3["status"] == "critical"
@patch("requests.get")
def test_get_cluster_services_with_errors(mock_get):
"""Test fetching services with detailed health check output."""
mock_navidrome = [
{
"Node": {"Node": "node1", "Address": "192.168.1.101"},
"Service": {"Service": "navidrome", "Port": 4533, "ID": "navidrome-1"},
"Checks": [
{"Status": "passing", "Output": "HTTP GET http://192.168.1.101:4533/app: 200 OK"}
]
}
]
mock_replicas = [
{
"Node": {"Node": "node3", "Address": "192.168.1.103"},
"Service": {"Service": "replica-navidrome", "Port": 4533, "ID": "replica-2"},
"Checks": [
{"Status": "critical", "Output": "HTTP GET http://192.168.1.103:4533/app: 500 Internal Server Error"}
]
}
]
def side_effect(url, params=None, timeout=None):
if "health/service/navidrome" in url:
m = MagicMock()
m.json.return_value = mock_navidrome
m.raise_for_status.return_value = None
return m
elif "health/service/replica-navidrome" in url:
m = MagicMock()
m.json.return_value = mock_replicas
m.raise_for_status.return_value = None
return m
return MagicMock()
mock_get.side_effect = side_effect
services = consul_client.get_cluster_services("http://consul:8500")
node3 = next(s for s in services if s["node"] == "node3")
assert node3["status"] == "critical"
assert "500 Internal Server Error" in node3["check_output"]

View File

@@ -0,0 +1,61 @@
import pytest
import output_formatter
def test_format_cluster_summary():
"""Test the summary string generation."""
cluster_data = {
"health": "Healthy",
"primary_count": 1,
"nodes": [],
"nomad_available": False
}
summary = output_formatter.format_summary(cluster_data)
assert "Healthy" in summary
assert "Primaries" in summary
assert "WARNING: Nomad CLI unavailable" in summary
def test_format_node_table():
"""Test the table generation."""
nodes = [
{
"node": "node1",
"role": "primary",
"status": "passing",
"uptime": 100,
"replication_lag": "N/A",
"litefs_primary": True
}
]
table = output_formatter.format_node_table(nodes, use_color=False)
assert "node1" in table
assert "primary" in table
assert "passing" in table
def test_format_diagnostics():
"""Test the diagnostics section generation."""
nodes = [
{
"node": "node3",
"status": "critical",
"check_output": "500 Internal Error",
"litefs_error": "Connection Timeout"
}
]
diagnostics = output_formatter.format_diagnostics(nodes, use_color=False)
assert "DIAGNOSTICS" in diagnostics
assert "node3" in diagnostics
assert "500 Internal Error" in diagnostics
assert "Connection Timeout" in diagnostics
def test_format_diagnostics_empty():
"""Test that diagnostics section is empty when no errors exist."""
nodes = [
{
"node": "node1",
"status": "passing",
"litefs_error": None
}
]
diagnostics = output_formatter.format_diagnostics(nodes, use_color=False)
assert diagnostics == ""

View File

@@ -0,0 +1,84 @@
import pytest
from unittest.mock import patch, MagicMock
import litefs_client
@patch("requests.get")
def test_get_node_status_primary(mock_get):
"""Test fetching LiteFS status for a primary node via /status."""
mock_status = {
"clusterID": "cid1",
"primary": True,
"uptime": 3600,
"advertiseURL": "http://192.168.1.101:20202"
}
m = MagicMock()
m.status_code = 200
m.json.return_value = mock_status
mock_get.return_value = m
status = litefs_client.get_node_status("192.168.1.101")
assert status["is_primary"] is True
assert status["uptime"] == 3600
assert status["advertise_url"] == "http://192.168.1.101:20202"
@patch("requests.get")
def test_get_node_status_fallback(mock_get):
"""Test fetching LiteFS status via /debug/vars fallback."""
def side_effect(url, **kwargs):
m = MagicMock()
if "/status" in url:
m.status_code = 404
return m
elif "/debug/vars" in url:
m.status_code = 200
m.json.return_value = {
"store": {"isPrimary": True}
}
return m
return m
mock_get.side_effect = side_effect
status = litefs_client.get_node_status("192.168.1.101")
assert status["is_primary"] is True
assert status["uptime"] == "N/A"
assert status["advertise_url"] == "http://192.168.1.101:20202"
@patch("requests.get")
def test_get_node_status_error(mock_get):
"""Test fetching LiteFS status with a connection error."""
mock_get.side_effect = Exception("Connection failed")
status = litefs_client.get_node_status("192.168.1.101")
assert "error" in status
assert status["is_primary"] is False
@patch("nomad_client.exec_command")
def test_get_node_status_nomad_exec(mock_exec):
"""Test fetching LiteFS status via nomad alloc exec."""
# Mock LiteFS status output (text format)
mock_status_output = """
Config:
Path: /etc/litefs.yml
...
Status:
Primary: true
Uptime: 1h5m10s
Replication Lag: 0s
"""
mock_exec.return_value = mock_status_output
# We need to mock requests.get to fail first
with patch("requests.get") as mock_get:
mock_get.side_effect = Exception("HTTP failed")
status = litefs_client.get_node_status("1.1.1.1", alloc_id="abc12345")
assert status["is_primary"] is True
assert status["uptime"] == "1h5m10s"
# Since it's primary, lag might not be shown or be 0
assert status["replication_lag"] == "0s"

View File

@@ -0,0 +1,19 @@
import pytest
from unittest.mock import patch, MagicMock
import cli
import sys
def test_arg_parsing_default():
"""Test that default arguments are parsed correctly."""
with patch.object(sys, 'argv', ['cli.py']):
args = cli.parse_args()
assert args.consul_url is None # Should use config default later
assert args.no_color is False
def test_arg_parsing_custom():
"""Test that custom arguments are parsed correctly."""
with patch.object(sys, 'argv', ['cli.py', '--consul-url', 'http://custom:8500', '--no-color', '--restart', 'node1']):
args = cli.parse_args()
assert args.consul_url == 'http://custom:8500'
assert args.no_color is True
assert args.restart == 'node1'

View File

@@ -0,0 +1,91 @@
import pytest
from unittest.mock import patch, MagicMock
import nomad_client
import subprocess
@patch("subprocess.run")
@patch("nomad_client.get_node_map")
def test_get_allocation_id(mock_node_map, mock_run):
"""Test getting allocation ID for a node."""
mock_node_map.return_value = {"node_id1": "node1"}
# Mock 'nomad job status navidrome-litefs' output
mock_job_status = MagicMock()
mock_job_status.stdout = """
Allocations
ID Node ID Task Group Version Desired Status Created Modified
abc12345 node_id1 navidrome 1 run running 1h ago 1h ago
"""
# Mock 'nomad alloc status abc12345' output
mock_alloc_status = MagicMock()
mock_alloc_status.stdout = "ID = abc12345-full-id"
mock_run.side_effect = [mock_job_status, mock_alloc_status]
alloc_id = nomad_client.get_allocation_id("node1", "navidrome-litefs")
assert alloc_id == "abc12345-full-id"
@patch("subprocess.run")
def test_get_logs(mock_run):
"""Test fetching logs for an allocation."""
mock_stderr = "Error: database is locked\nSome other error"
m = MagicMock()
m.stdout = mock_stderr
m.return_code = 0
mock_run.return_value = m
logs = nomad_client.get_allocation_logs("abc12345", tail=20)
assert "database is locked" in logs
# It should have tried with -task navidrome first
mock_run.assert_any_call(
["nomad", "alloc", "logs", "-stderr", "-task", "navidrome", "-n", "20", "abc12345"],
capture_output=True, text=True, check=True
)
@patch("subprocess.run")
def test_restart_allocation(mock_run):
"""Test restarting an allocation."""
m = MagicMock()
m.return_code = 0
mock_run.return_value = m
success = nomad_client.restart_allocation("abc12345")
assert success is True
mock_run.assert_called_with(
["nomad", "alloc", "restart", "abc12345"],
capture_output=True, text=True, check=True
)
@patch("subprocess.run")
def test_exec_command(mock_run):
"""Test executing a command in an allocation."""
m = MagicMock()
m.stdout = "Command output"
m.return_code = 0
mock_run.return_value = m
output = nomad_client.exec_command("abc12345", ["ls", "/data"])
assert output == "Command output"
mock_run.assert_called_with(
["nomad", "alloc", "exec", "-task", "navidrome", "abc12345", "ls", "/data"],
capture_output=True, text=True, check=True
)
@patch("subprocess.run")
def test_exec_command_failure(mock_run):
"""Test executing a command handles failure gracefully."""
mock_run.side_effect = subprocess.CalledProcessError(1, "nomad", stderr="Nomad error")
output = nomad_client.exec_command("abc12345", ["ls", "/data"])
assert "Nomad Error" in output
assert "Nomad error" not in output # The exception str might not contain stderr directly depending on python version
@patch("subprocess.run")
def test_get_node_map_failure(mock_run):
"""Test get_node_map handles failure."""
mock_run.side_effect = FileNotFoundError("No such file")
# It should not raise
node_map = nomad_client.get_node_map()
assert node_map == {}