Compare commits
69 Commits
29c9c697ee
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 045fc6e82b | |||
| e56fb94fdc | |||
| decb9f5860 | |||
| 76398dec99 | |||
| 2746b4a550 | |||
| 8c58004d1c | |||
| ad49e12368 | |||
| 1c693aade4 | |||
| 21e9c3d72d | |||
| c5a3cbfeb8 | |||
| 655a9b2571 | |||
| 079498caba | |||
| 353683e2bf | |||
| 9b6159a40c | |||
| ed8e7608f1 | |||
| 3c5968690c | |||
| 139016f121 | |||
| d97730174d | |||
| 27b10a39b8 | |||
| 51b8fce10b | |||
| f7fe258480 | |||
| 5a557145ac | |||
| 01fa06e7dc | |||
| f8a2a587d5 | |||
| bfe7ade47c | |||
| 8e109e6fb5 | |||
| 640a76bbd1 | |||
| 4fc5fc3d9d | |||
| 82db5794dd | |||
| 7041b53fd3 | |||
| 5e1c7c116e | |||
| 9cd5455378 | |||
| e9b26f6eb9 | |||
| 5cbb657c45 | |||
| c9bf1410c9 | |||
| ef91b8e9af | |||
| f51baf7949 | |||
| 396dfeb7a3 | |||
| ee297f8d8e | |||
| e9c52d2847 | |||
| b8a12b1285 | |||
| 6d1b473efa | |||
| 860000bd04 | |||
| 22ec8a5cc0 | |||
| 6d77729a4a | |||
| a686c5b225 | |||
| 7c0c146d0c | |||
| 3c4c1c4d80 | |||
| f367f93768 | |||
| c7e7c9fd7b | |||
| cbd109a8bc | |||
| c0dcb1a47d | |||
| 20d99be67d | |||
| 16aad2958a | |||
| 90ffed531f | |||
| 1749cc29ee | |||
| e71d5e2ffc | |||
| 3eaa84872c | |||
| 0af6a3270e | |||
| 264d518e6e | |||
| dec669e89b | |||
| 99d9e00fc9 | |||
| 41874f0ace | |||
| a10a50a6ea | |||
| cd6a6efb80 | |||
| 20547c6c25 | |||
| 8568f4e9b9 | |||
| 5a5790205b | |||
| 9b2b09b02a |
3
.gitignore
vendored
Normal file
3
.gitignore
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
.venv/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
17
Dockerfile
17
Dockerfile
@@ -4,17 +4,22 @@ FROM flyio/litefs:0.5 AS litefs
|
||||
# Stage 2: Final image
|
||||
FROM ghcr.io/navidrome/navidrome:latest
|
||||
|
||||
# Install FUSE and CA certificates (needed for LiteFS)
|
||||
# Install dependencies
|
||||
USER root
|
||||
RUN apk add --no-cache fuse3 ca-certificates
|
||||
RUN apk add --no-cache fuse3 ca-certificates bash curl jq
|
||||
|
||||
# Copy LiteFS binary
|
||||
COPY --from=litefs /usr/local/bin/litefs /usr/local/bin/litefs
|
||||
|
||||
# Copy scripts
|
||||
COPY entrypoint.sh /usr/local/bin/entrypoint.sh
|
||||
RUN chmod +x /usr/local/bin/entrypoint.sh
|
||||
|
||||
# Copy LiteFS configuration
|
||||
COPY litefs.yml /etc/litefs.yml
|
||||
|
||||
# We'll use environment variables for most LiteFS settings,
|
||||
# but the baked-in config provides the structure.
|
||||
# LiteFS will mount the FUSE fs and then execute Navidrome.
|
||||
ENTRYPOINT ["litefs", "mount", "--", "/app/navidrome"]
|
||||
# LiteFS becomes the supervisor.
|
||||
|
||||
# It will mount the FUSE fs and then execute the command defined in litefs.yml's exec section.
|
||||
|
||||
ENTRYPOINT ["litefs", "mount"]
|
||||
|
||||
22
check-fw.nomad
Normal file
22
check-fw.nomad
Normal file
@@ -0,0 +1,22 @@
|
||||
job "check-firewall" {
|
||||
datacenters = ["dc1"]
|
||||
type = "batch"
|
||||
|
||||
group "check" {
|
||||
count = 1
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
value = "odroid7"
|
||||
}
|
||||
|
||||
task "check" {
|
||||
driver = "docker"
|
||||
config {
|
||||
image = "busybox"
|
||||
network_mode = "host"
|
||||
command = "sh"
|
||||
args = ["-c", "echo 'UFW is not installed in busybox, checking port 20202 from outside'"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -3,11 +3,11 @@ job "cleanup-litefs-all" {
|
||||
type = "batch"
|
||||
|
||||
group "cleanup" {
|
||||
count = 2
|
||||
count = 4
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
operator = "regexp"
|
||||
value = "odroid7|odroid8"
|
||||
value = "odroid6|odroid7|odroid8|opti1"
|
||||
}
|
||||
|
||||
task "clean" {
|
||||
|
||||
@@ -0,0 +1,5 @@
|
||||
# Track cluster_status_python_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "cluster_status_python_20260208",
|
||||
"type": "feature",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T15:00:00Z",
|
||||
"updated_at": "2026-02-08T15:00:00Z",
|
||||
"description": "create a script that runs on my local system (i don't run consul locally) that: - check consul services are registered correctly - diplays the expected state (who is primary, what replicas exist) - show basic litefs status info for each node"
|
||||
}
|
||||
31
conductor/archive/cluster_status_python_20260208/plan.md
Normal file
31
conductor/archive/cluster_status_python_20260208/plan.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Plan: Cluster Status Script (`cluster_status_python`)
|
||||
|
||||
## Phase 1: Environment and Project Structure [x] [checkpoint: e71d5e2]
|
||||
- [x] Task: Initialize Python project structure (venv, requirements.txt)
|
||||
- [x] Task: Create initial configuration for Consul connectivity (default URLs and env var support)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Environment and Project Structure' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Core Data Fetching [x] [checkpoint: 90ffed5]
|
||||
- [x] Task: Implement Consul API client to fetch `navidrome` and `replica-navidrome` services
|
||||
- [x] Write tests for fetching services from Consul (mocking API)
|
||||
- [x] Implement service discovery logic
|
||||
- [x] Task: Implement LiteFS HTTP API client to fetch node status
|
||||
- [x] Write tests for fetching LiteFS status (mocking API)
|
||||
- [x] Implement logic to query `:20202/status` for each discovered node
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Core Data Fetching' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Data Processing and Formatting [x] [checkpoint: 20d99be]
|
||||
- [x] Task: Implement data aggregation logic
|
||||
- [x] Write tests for aggregating Consul and LiteFS data into a single cluster state object
|
||||
- [x] Implement logic to calculate overall cluster health and role assignment
|
||||
- [x] Task: Implement CLI output formatting (Table and Color)
|
||||
- [x] Write tests for table formatting and color-coding logic
|
||||
- [x] Implement `tabulate` based output with a health summary
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Data Processing and Formatting' (Protocol in workflow.md)
|
||||
|
||||
## Phase 4: CLI Interface and Final Polishing [x]
|
||||
- [x] Task: Implement command-line arguments (argparse)
|
||||
- [x] Write tests for CLI argument parsing (Consul URL overrides, etc.)
|
||||
- [x] Finalize the `main` entry point
|
||||
- [x] Task: Final verification of script against requirements
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 4: CLI Interface and Final Polishing' (Protocol in workflow.md)
|
||||
40
conductor/archive/cluster_status_python_20260208/spec.md
Normal file
40
conductor/archive/cluster_status_python_20260208/spec.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Specification: Cluster Status Script (`cluster_status_python`)
|
||||
|
||||
## Overview
|
||||
Create a Python-based CLI script to be run on a local system (outside the cluster) to monitor the health and status of the Navidrome LiteFS/Consul cluster. This tool will bridge the gap for local monitoring without needing a local Consul instance.
|
||||
|
||||
## Functional Requirements
|
||||
- **Consul Connectivity:**
|
||||
- Connect to a remote Consul instance.
|
||||
- Default to a hardcoded URL with support for overrides via command-line arguments (e.g., `--consul-url`) or environment variables (`CONSUL_HTTP_ADDR`).
|
||||
- Assume no Consul authentication token is required.
|
||||
- **Service Discovery:**
|
||||
- Query Consul for the `navidrome` (Primary) and `replica-navidrome` (Replica) services.
|
||||
- Verify that services are registered correctly and health checks are passing.
|
||||
- **Status Reporting:**
|
||||
- Display a text-based table summarizing the state of all nodes in the cluster.
|
||||
- Color-coded output for quick health assessment.
|
||||
- Include a summary section at the top indicating overall cluster health.
|
||||
- **Node-Level Details:**
|
||||
- Role identification (Primary vs. Replica).
|
||||
- Uptime of the LiteFS process.
|
||||
- Advertise URL for each node.
|
||||
- Replication Lag (for Replicas).
|
||||
- Write-forwarding proxy target (for Replicas).
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Language:** Python 3.x.
|
||||
- **Dependencies:** Use standard libraries or common packages like `requests` for API calls and `tabulate` for table formatting.
|
||||
- **Portability:** Must run on Linux (user's OS) without requiring local Consul or Nomad binaries.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Script successfully retrieves service list from remote Consul.
|
||||
- [ ] Script correctly identifies the current Primary node based on Consul tags/service names.
|
||||
- [ ] Script queries the LiteFS HTTP API (`:20202/status`) on each node to gather internal metrics.
|
||||
- [ ] Output is formatted as a clear, readable text table.
|
||||
- [ ] Overrides for Consul URL are functional.
|
||||
|
||||
## Out of Scope
|
||||
- Direct interaction with Nomad API (Consul is the source of truth for this script).
|
||||
- Database-level inspection (SQL queries).
|
||||
- Remote log tailing.
|
||||
5
conductor/archive/diagnose_and_enhance_20260208/index.md
Normal file
5
conductor/archive/diagnose_and_enhance_20260208/index.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# Track diagnose_and_enhance_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "diagnose_and_enhance_20260208",
|
||||
"type": "feature",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T16:00:00Z",
|
||||
"updated_at": "2026-02-08T16:00:00Z",
|
||||
"description": "diagnose why odroid8 shows as critical - why is the script not showing uptime or replication lag?"
|
||||
}
|
||||
30
conductor/archive/diagnose_and_enhance_20260208/plan.md
Normal file
30
conductor/archive/diagnose_and_enhance_20260208/plan.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Plan: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
|
||||
|
||||
## Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]
|
||||
- [x] Task: Update `consul_client.py` to fetch detailed health check output
|
||||
- [x] Write tests for fetching `Output` field from Consul checks
|
||||
- [x] Implement logic to extract and store the `Output` (error message)
|
||||
- [x] Task: Update aggregator and formatter to display Consul errors
|
||||
- [x] Update aggregation logic to include `consul_error`
|
||||
- [x] Update table formatter to indicate an error (maybe a flag or color)
|
||||
- [x] Add a "Diagnostics" section to the output to print full error details
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]
|
||||
- [x] Task: Implement `nomad_client.py` wrapper
|
||||
- [x] Write tests for `get_allocation_logs`, `get_node_status`, and `restart_allocation` (mocking subprocess)
|
||||
- [x] Implement `subprocess.run(["nomad", ...])` logic to fetch logs and restart allocations
|
||||
- [x] Task: Integrate Nomad logs into diagnosis
|
||||
- [x] Update aggregator to call Nomad client for critical nodes
|
||||
- [x] Update "Diagnostics" section to display the last 20 lines of stderr
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Advanced LiteFS Status [ ]
|
||||
- [x] Task: Implement `litefs_status` via `nomad alloc exec`
|
||||
- [x] Write tests for executing remote commands via Nomad
|
||||
- [x] Update `litefs_client.py` to fallback to `nomad alloc exec` if HTTP fails
|
||||
- [x] Parse `litefs status` output (text/json) to extract uptime and replication lag
|
||||
- [x] Task: Final Polish and Diagnosis Run
|
||||
- [x] Ensure all pieces work together
|
||||
- [x] Run the script to diagnose `odroid8`
|
||||
- [~] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)
|
||||
32
conductor/archive/diagnose_and_enhance_20260208/spec.md
Normal file
32
conductor/archive/diagnose_and_enhance_20260208/spec.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# Specification: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
|
||||
|
||||
## Overview
|
||||
Investigate the root cause of the "critical" status on node `odroid8` and enhance the `cluster_status` script to provide deeper diagnostic information, including Consul health check details, Nomad logs, and LiteFS status via `nomad alloc exec`.
|
||||
|
||||
## Functional Requirements
|
||||
- **Consul Diagnostics:**
|
||||
- Update `consul_client.py` to fetch the `Output` field from Consul health checks.
|
||||
- Display the specific error message from Consul when a node is in a non-passing state.
|
||||
- **Nomad Integration:**
|
||||
- Implement a `nomad_client.py` (or similar) to interact with the Nomad CLI.
|
||||
- **Log Retrieval:** For nodes with health issues, fetch the last 20 lines of `stderr` from the relevant Nomad allocation.
|
||||
- **Internal Status:** Implement a method to run `litefs status` inside the container using `nomad alloc exec` to retrieve uptime and replication lag when the HTTP API is unavailable.
|
||||
- **LiteFS API Investigation:**
|
||||
- Investigate why the `/status` endpoint returns a 404 and attempt to resolve it via configuration or by identifying the correct endpoint for LiteFS 0.5.
|
||||
- **Output Formatting:**
|
||||
- Update the CLI table to display the retrieved uptime and replication lag.
|
||||
- Add a "Diagnostics" section at the bottom of the output when errors are detected, showing the Consul check output and Nomad logs.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **CLI Dependency:** The script assumes the `nomad` CLI is installed and configured in the local system's PATH.
|
||||
- **Language:** Python 3.x.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Script displays the specific reason for `odroid8`'s critical status from Consul.
|
||||
- [ ] Script successfully retrieves and displays Nomad stderr logs for failing nodes.
|
||||
- [ ] Script displays LiteFS uptime and replication lag (retrieved via `alloc exec` or HTTP).
|
||||
- [ ] The root cause of `odroid8`'s failure is identified and reported.
|
||||
|
||||
## Out of Scope
|
||||
- Automatic fixing of the identified issues (diagnosis only).
|
||||
- Direct SSH into nodes (using Nomad/Consul APIs/CLIs only).
|
||||
5
conductor/archive/fix_litefs_config_20260208/index.md
Normal file
5
conductor/archive/fix_litefs_config_20260208/index.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# Track fix_litefs_config_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "fix_litefs_config_20260208",
|
||||
"type": "feature",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T18:00:00Z",
|
||||
"updated_at": "2026-02-08T18:00:00Z",
|
||||
"description": "Fix LiteFS configuration to use 'exec' for Navidrome and ensure it only runs on the Primary node. Also fix DB path configuration."
|
||||
}
|
||||
22
conductor/archive/fix_litefs_config_20260208/plan.md
Normal file
22
conductor/archive/fix_litefs_config_20260208/plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Plan: Fix LiteFS Configuration and Process Management (`fix_litefs_config`)
|
||||
|
||||
## Phase 1: Configuration and Image Structure [ ]
|
||||
- [x] Task: Update `litefs.yml` to include the `exec` block (396dfeb)
|
||||
- [x] Task: Update `Dockerfile` to use LiteFS as the supervisor (`ENTRYPOINT ["litefs", "mount"]`) (ef91b8e)
|
||||
- [x] Task: Update `navidrome-litefs-v2.nomad` with corrected storage paths (`ND_DATAFOLDER`, `ND_CACHEFOLDER`, `ND_BACKUP_PATH`) (5cbb657)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 1: Configuration and Image Structure' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Entrypoint and Registration Logic [x] [checkpoint: 9cd5455]
|
||||
- [x] Task: Refactor `entrypoint.sh` to handle leadership-aware process management (9cd5455)
|
||||
- [x] Integrate Consul registration logic (from `register.sh`)
|
||||
- [x] Implement loop to start/stop Navidrome based on `/data/.primary` existence
|
||||
- [x] Ensure proper signal handling for Navidrome shutdown
|
||||
- [x] Task: Clean up redundant scripts (e.g., `register.sh` if fully integrated) (9cd5455)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 2: Entrypoint and Registration Logic' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Failover Verification [ ]
|
||||
- [~] Task: Build and push the updated Docker image via Gitea Actions (if possible) or manual trigger
|
||||
- [~] Task: Deploy the updated Nomad job
|
||||
- [ ] Task: Verify cluster health and process distribution using `cluster_status` script
|
||||
- [ ] Task: Perform a manual failover (stop primary allocation) and verify Navidrome migrates correctly
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Failover Verification' (Protocol in workflow.md)
|
||||
38
conductor/archive/fix_litefs_config_20260208/spec.md
Normal file
38
conductor/archive/fix_litefs_config_20260208/spec.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Specification: Fix LiteFS Configuration and Process Management (`fix_litefs_config`)
|
||||
|
||||
## Overview
|
||||
Reconfigure the Navidrome/LiteFS process management to ensure Navidrome and its Consul service registration only occur on the Primary node. This will be achieved by leveraging the LiteFS `exec` block and updating the `entrypoint.sh` logic. Additionally, correct the Navidrome database and storage paths to properly utilize the LiteFS replicated mount.
|
||||
|
||||
## Functional Requirements
|
||||
- **LiteFS Configuration (`litefs.yml`):**
|
||||
- Enable the `exec` block to trigger `/usr/local/bin/entrypoint.sh`.
|
||||
- This allows LiteFS to manage the lifecycle of the application.
|
||||
- **Entrypoint Logic (`entrypoint.sh`):**
|
||||
- Implement a supervision loop that monitors leadership via the `/data/.primary` file.
|
||||
- **On Primary:**
|
||||
- Register the node as the `navidrome` (primary) service in Consul.
|
||||
- Start the Navidrome process.
|
||||
- **On Replica:**
|
||||
- Ensure Navidrome is NOT running.
|
||||
- Deregister the `navidrome` primary service if previously registered.
|
||||
- (Optional) Register as a replica service or simply wait.
|
||||
- **On Transition:** Handle graceful shutdown of Navidrome if the node loses leadership.
|
||||
- **Storage and Path Configuration (`navidrome-litefs-v2.nomad`):**
|
||||
- Set `ND_DATAFOLDER` to `/data` (the LiteFS FUSE mount).
|
||||
- Set `ND_CACHEFOLDER` to `/shared_data/cache` (shared persistent storage).
|
||||
- Set `ND_BACKUP_PATH` to `/shared_data/backup` (shared persistent storage).
|
||||
- **Dockerfile Updates:**
|
||||
- Update `ENTRYPOINT` to `["litefs", "mount"]` to allow LiteFS to act as the supervisor.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Robustness:** Use a simple bash loop for process management to avoid extra dependencies.
|
||||
- **Signal Handling:** Ensure signals (SIGTERM) are correctly forwarded to Navidrome for graceful shutdown.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Navidrome process runs ONLY on the Primary node.
|
||||
- [ ] Consul service `navidrome` correctly points to the current Primary.
|
||||
- [ ] Navidrome database (`navidrome.db`) is confirmed to be on the `/data` mount.
|
||||
- [ ] Cluster failover correctly stops Navidrome on the old primary and starts it on the new one.
|
||||
|
||||
## Out of Scope
|
||||
- Implementation of complex init systems like `tini` (bash loop selected by user).
|
||||
@@ -0,0 +1,5 @@
|
||||
# Track fix_odroid8_and_script_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "fix_odroid8_and_script_20260208",
|
||||
"type": "bug",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T17:00:00Z",
|
||||
"updated_at": "2026-02-08T17:00:00Z",
|
||||
"description": "fix odroid8 node - stuck in a loop because of a LiteFS Cluster ID mismatch in the Consul lease. Also fix script execution errors."
|
||||
}
|
||||
26
conductor/archive/fix_odroid8_and_script_20260208/plan.md
Normal file
26
conductor/archive/fix_odroid8_and_script_20260208/plan.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
|
||||
|
||||
## Phase 1: Script Robustness [x] [checkpoint: 860000b]
|
||||
- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully
|
||||
- [x] Write tests for handling Nomad CLI absence/failure
|
||||
- [x] Update implementation to return descriptive error objects or `None` without crashing
|
||||
- [x] Task: Update aggregator and formatter to handle Nomad errors
|
||||
- [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail
|
||||
- [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells
|
||||
- [x] Add a global "Nomad Connectivity Warning" to the summary
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Odroid8 Recovery [ ]
|
||||
- [x] Task: Identify and verify `odroid8` LiteFS data path
|
||||
- [x] Run `nomad alloc status` to find the volume mount for `odroid8`
|
||||
- [x] Provide the user with the exact host path to the LiteFS data
|
||||
- [x] Task: Guide user through manual cleanup
|
||||
- [x] Provide steps to stop the allocation
|
||||
- [x] Provide the `rm` command to clear the LiteFS metadata
|
||||
- [x] Provide steps to restart and verify the node
|
||||
- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Final Verification [x]
|
||||
- [x] Task: Final verification run of the script
|
||||
- [x] Task: Verify cluster health in Consul and LiteFS API
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
27
conductor/archive/fix_odroid8_and_script_20260208/spec.md
Normal file
27
conductor/archive/fix_odroid8_and_script_20260208/spec.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Specification: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
|
||||
|
||||
## Overview
|
||||
Address the "critical" loop on node `odroid8` caused by a LiteFS Cluster ID mismatch and improve the `cluster_status` script's error handling when the Nomad CLI is unavailable or misconfigured.
|
||||
|
||||
## Functional Requirements
|
||||
- **Node Recovery (`odroid8`):**
|
||||
- Identify the specific LiteFS data directory on `odroid8` (usually `/var/lib/litefs` inside the container, mapped to a host path).
|
||||
- Guide the user through stopping the allocation and wiping the metadata/data to resolve the Consul lease conflict.
|
||||
- **Script Robustness:**
|
||||
- Update `nomad_client.py` to handle `subprocess` failures more gracefully.
|
||||
- If a `nomad` command fails, the script should not print a traceback or confusing "non-zero exit status" messages to the primary output.
|
||||
- Instead, it should log the error to `stderr` and continue, marking the affected fields (like logs or full ID) as "Nomad Error".
|
||||
- Add a clear warning in the output if Nomad connectivity is lost, suggesting the user verify `NOMAD_ADDR`.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Reliability:** The script should remain functional even if one of the underlying tools (Nomad CLI) is broken.
|
||||
- **Ease of Use:** Provide clear, copy-pasteable commands for the manual cleanup process.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] `odroid8` node successfully joins the cluster and shows as `passing` in Consul.
|
||||
- [ ] `cluster_status` script runs without error even if the `nomad` binary is missing or cannot connect to the server (showing fallback info).
|
||||
- [ ] Script provides a helpful message when `nomad` commands fail.
|
||||
|
||||
## Out of Scope
|
||||
- Fixing the Navidrome database path issue (this will be handled in a separate track once the cluster is stable).
|
||||
- Automating the host-level cleanup (manual guidance only).
|
||||
5
conductor/archive/fix_routing_20260207/index.md
Normal file
5
conductor/archive/fix_routing_20260207/index.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# Track fix_routing_20260207 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
8
conductor/archive/fix_routing_20260207/metadata.json
Normal file
8
conductor/archive/fix_routing_20260207/metadata.json
Normal file
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "fix_routing_20260207",
|
||||
"type": "bug",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-07T17:36:00Z",
|
||||
"updated_at": "2026-02-07T17:36:00Z",
|
||||
"description": "fix routing - use litefs to register the navidrome service with consul. the serivce should point to the master and avoid the litefs proxy (it breaks navidrome)"
|
||||
}
|
||||
25
conductor/archive/fix_routing_20260207/plan.md
Normal file
25
conductor/archive/fix_routing_20260207/plan.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Implementation Plan: Direct Primary Routing for Navidrome-LiteFS
|
||||
|
||||
This plan outlines the steps to reconfigure the Navidrome-LiteFS cluster to bypass the LiteFS write-forwarding proxy and use direct primary node routing for improved reliability and performance.
|
||||
|
||||
## Phase 1: Infrastructure Configuration Update [checkpoint: 5a57902]
|
||||
In this phase, we will modify the Nomad job and LiteFS configuration to support direct port access and primary-aware health checks.
|
||||
|
||||
- [x] Task: Update `navidrome-litefs-v2.nomad` to point service directly to Navidrome port
|
||||
- [x] Modify `service` block to use port 4533 instead of dynamic mapped port.
|
||||
- [x] Replace HTTP health check with a script check running `litefs is-primary`.
|
||||
- [x] Task: Update `litefs.yml` to ensure consistent internal API binding (if needed)
|
||||
- [x] Task: Conductor - User Manual Verification 'Infrastructure Configuration Update' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Deployment and Validation
|
||||
In this phase, we will deploy the changes and verify that the cluster correctly handles primary election and routing.
|
||||
|
||||
- [x] Task: Deploy updated Nomad job
|
||||
- [x] Execute `nomad job run navidrome-litefs-v2.nomad`.
|
||||
- [x] Task: Verify Consul health status
|
||||
- [x] Confirm that only the LiteFS primary node is marked as `passing`.
|
||||
- [x] Confirm that replica nodes are marked as `critical`.
|
||||
- [x] Task: Verify Ingress Routing
|
||||
- [x] Confirm Traefik correctly routes traffic only to the primary node.
|
||||
- [x] Verify that Navidrome is accessible and functional.
|
||||
- [x] Task: Conductor - User Manual Verification 'Deployment and Validation' (Protocol in workflow.md)
|
||||
26
conductor/archive/fix_routing_20260207/spec.md
Normal file
26
conductor/archive/fix_routing_20260207/spec.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Specification: Direct Primary Routing for Navidrome-LiteFS
|
||||
|
||||
## Overview
|
||||
This track aims to fix routing issues caused by the LiteFS proxy. We will reconfigure the Nomad service registration to point directly to the Navidrome process (port 4533) on the primary node, bypassing the LiteFS write-forwarding proxy (port 8080). To ensure Traefik only routes traffic to the node capable of writes, we will implement a "Primary-only" health check.
|
||||
|
||||
## Functional Requirements
|
||||
- **Direct Port Mapping:** Update the Nomad `service` block to use the host port `4533` directly instead of the LiteFS proxy port.
|
||||
- **Primary-Aware Health Check:** Replace the standard HTTP health check with a script check.
|
||||
- **Check Logic:** The script will execute `litefs is-primary`.
|
||||
- If the node is the primary, the command exits with `0` (Passing).
|
||||
- If the node is a replica, the command exits with a non-zero code (Critical).
|
||||
- **Service Tags:** Retain all existing Traefik tags so ingress routing continues to work.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Failover Reliability:** In the event of a leader election, the old primary must become unhealthy and the new primary must become healthy in Consul, allowing Traefik to update its backends automatically.
|
||||
- **Minimal Latency:** Bypassing the proxy eliminates the extra network hop for reads and potential compatibility issues with Navidrome's connection handling.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Consul reports the service as `passing` only on the node currently holding the LiteFS primary lease.
|
||||
- [ ] Consul reports the service as `critical` on all replica nodes.
|
||||
- [ ] Traefik correctly routes traffic to the primary node.
|
||||
- [ ] Navidrome is accessible and functions correctly without the LiteFS proxy intermediary.
|
||||
|
||||
## Out of Scope
|
||||
- Modifying Navidrome internal logic.
|
||||
- Implementing an external health-check responder.
|
||||
@@ -0,0 +1,5 @@
|
||||
# Track implement_ttl_heartbeat_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "implement_ttl_heartbeat_20260208",
|
||||
"type": "enhancement",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T19:00:00Z",
|
||||
"updated_at": "2026-02-08T19:00:00Z",
|
||||
"description": "Implement TTL Heartbeat architecture for robust Consul service registration and cleaner failure handling."
|
||||
}
|
||||
22
conductor/archive/implement_ttl_heartbeat_20260208/plan.md
Normal file
22
conductor/archive/implement_ttl_heartbeat_20260208/plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Plan: Implement TTL Heartbeat Service Registration (`implement_ttl_heartbeat`)
|
||||
|
||||
## Phase 1: Container Environment Preparation [x] [checkpoint: 51b8fce]
|
||||
- [x] Task: Update `Dockerfile` to install `curl` and `jq` (f7fe258)
|
||||
- [x] Task: Verify `litefs.yml` points to `entrypoint.sh` (should already be correct) (verified)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Container Environment Preparation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Script Implementation [x] [checkpoint: 139016f]
|
||||
- [x] Task: Refactor `entrypoint.sh` with the TTL Heartbeat logic (d977301)
|
||||
- [x] Implement `register_service` with TTL check definition
|
||||
- [x] Implement `pass_ttl` loop
|
||||
- [x] Implement robust `stop_app` and signal trapping
|
||||
- [x] Ensure correct Primary/Replica detection logic (LiteFS 0.5: Primary = No `.primary` file)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Script Implementation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Verification [ ]
|
||||
- [~] Task: Commit changes and push to Gitea to trigger build
|
||||
- [ ] Task: Monitor Gitea build completion
|
||||
- [ ] Task: Deploy updated Nomad job (forcing update if necessary)
|
||||
- [ ] Task: Verify "Clean" state in Consul (only one primary registered)
|
||||
- [ ] Task: Verify Failover/Stop behavior (immediate deregistration vs TTL expiry)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Verification' (Protocol in workflow.md)
|
||||
32
conductor/archive/implement_ttl_heartbeat_20260208/spec.md
Normal file
32
conductor/archive/implement_ttl_heartbeat_20260208/spec.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# Specification: Implement TTL Heartbeat Service Registration (`implement_ttl_heartbeat`)
|
||||
|
||||
## Overview
|
||||
Replace the current "register and forget" Consul registration logic with a robust "TTL Heartbeat" pattern. This ensures that only the active Primary node is registered in Consul, and service entries are automatically removed (deregistered) if the node crashes, failover occurs, or Nomad stops the allocation.
|
||||
|
||||
## Functional Requirements
|
||||
- **Supervisor Script (`entrypoint.sh`):**
|
||||
- Refactor to implement the "Self-Registration" pattern with TTL checks.
|
||||
- **Leadership Detection:** Monitor `/data/.primary` (LiteFS 0.5).
|
||||
- **Primary:** Absence of file. Start Navidrome, register service with TTL.
|
||||
- **Replica:** Presence of file. Stop Navidrome, deregister service.
|
||||
- **Heartbeat:** Periodically (e.g., every 5-10s) PUT to Consul to pass the TTL check while Primary.
|
||||
- **Signal Handling:** Trap `SIGTERM`/`SIGINT` to gracefully stop Navidrome and deregister immediately.
|
||||
- **Docker Image:**
|
||||
- Ensure `curl` and `jq` are installed (prerequisites for the script).
|
||||
- **Nomad Configuration:**
|
||||
- Ensure `NOMAD_IP_http` and `NOMAD_PORT_http` are accessible to the task (standard, but verifying).
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Resilience:** The script must handle Consul unavailability gracefully (retries) without crashing the application loop.
|
||||
- **Cleanliness:** No "ghost" services. Replicas must not appear in the service catalog.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Navidrome runs ONLY on the Primary.
|
||||
- [ ] Only ONE `navidrome` service is registered in Consul (pointing to the Primary).
|
||||
- [ ] Stopping the Primary allocation results in immediate deregistration (via trap).
|
||||
- [ ] Hard killing the Primary allocation results in deregistration after TTL expires (approx 15s).
|
||||
- [ ] Replicas do not register any service.
|
||||
|
||||
## Implementation Details
|
||||
- **Script Name:** We will stick with `entrypoint.sh` for consistency with `litefs.yml` configuration, refactoring its content.
|
||||
- **Service ID:** Use `navidrome-${NOMAD_ALLOC_ID}` to ensure uniqueness and traceability.
|
||||
@@ -0,0 +1,5 @@
|
||||
# Track update_monitor_discovery_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "update_monitor_discovery_20260208",
|
||||
"type": "enhancement",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T20:00:00Z",
|
||||
"updated_at": "2026-02-08T20:00:00Z",
|
||||
"description": "Update cluster monitoring script to discover nodes via Nomad instead of Consul, ensuring all replicas are visible."
|
||||
}
|
||||
25
conductor/archive/update_monitor_discovery_20260208/plan.md
Normal file
25
conductor/archive/update_monitor_discovery_20260208/plan.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Plan: Update Monitor Discovery Logic (`update_monitor_discovery`)
|
||||
|
||||
## Phase 1: Nomad Discovery Enhancement [x] [checkpoint: 353683e]
|
||||
- [x] Task: Update `nomad_client.py` to fetch job allocations with IPs (353683e)
|
||||
- [x] Write tests for parsing allocation IPs from `nomad job status` or `nomad alloc status`
|
||||
- [x] Implement `get_job_allocations(job_id)` returning a list of dicts (id, node, ip)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Nomad Discovery Enhancement' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Aggregator Refactor [x] [checkpoint: 655a9b2]
|
||||
- [x] Task: Refactor `cluster_aggregator.py` to drive discovery via Nomad (655a9b2)
|
||||
- [x] Update `get_cluster_status` to call `nomad_client.get_job_allocations` first
|
||||
- [x] Update loop to iterate over allocations and supplement with LiteFS and Consul data
|
||||
- [x] Task: Update `consul_client.py` to fetch all services once and allow lookup by IP/ID (655a9b2)
|
||||
- [x] Task: Update tests for the new discovery flow (655a9b2)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Aggregator Refactor' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: UI and Health Logic [x] [checkpoint: 21e9c3d]
|
||||
- [x] Task: Update `output_formatter.py` for "Standby" nodes (21e9c3d)
|
||||
- [x] Update table formatting to handle missing Consul status for replicas
|
||||
- [x] Task: Update Cluster Health calculation (21e9c3d)
|
||||
- [x] "Healthy" = 1 Primary (Consul passing) + N Replicas (LiteFS connected)
|
||||
- [~] Task: Extract Uptime from Nomad and internal LiteFS states (txid, checksum)
|
||||
- [~] Task: Update aggregator and formatter to display detailed database info
|
||||
- [x] Task: Final verification run (21e9c3d)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
30
conductor/archive/update_monitor_discovery_20260208/spec.md
Normal file
30
conductor/archive/update_monitor_discovery_20260208/spec.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Specification: Update Monitor Discovery Logic (`update_monitor_discovery`)
|
||||
|
||||
## Overview
|
||||
Refactor the cluster monitoring script (`scripts/cluster_status`) to use Nomad as the primary source of discovery. Currently, the script queries Consul for services, which only shows the Primary node in the new architecture. By querying Nomad allocations first, we can identify all running LiteFS nodes (Primary and Replicas) and then inspect their individual health and replication status.
|
||||
|
||||
## Functional Requirements
|
||||
- **Nomad Client (`nomad_client.py`):**
|
||||
- Add a function to list all active allocations for a job and extract their host IPs and node names.
|
||||
- **Consul Client (`consul_client.py`):**
|
||||
- Modify to allow checking the registration status of a *specific* node/allocation ID rather than just listing all services.
|
||||
- **Aggregator (`cluster_aggregator.py`):**
|
||||
- **New Discovery Flow:**
|
||||
1. Query Nomad for all allocations of `navidrome-litefs`.
|
||||
2. For each allocation:
|
||||
- Get the Node Name and IP.
|
||||
- Query the LiteFS API (`:20202`) on that IP for role/DB info.
|
||||
- Query Consul to see if a matching service registration exists (and its health).
|
||||
- **Formatter (`output_formatter.py`):**
|
||||
- Handle nodes that are "Standby" (running in Nomad and LiteFS, but not registered in Consul).
|
||||
- Ensure the table correctly displays all 4 nodes.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Efficiency:** Minimize CLI calls by batching Nomad/Consul queries where possible.
|
||||
- **Robustness:** Gracefully handle cases where an allocation has no IP yet (starting state).
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Script output shows all 4 Nomad allocations.
|
||||
- [ ] Primary node is clearly identified with its Consul health status.
|
||||
- [ ] Replica nodes are shown with their LiteFS role and DB status, even if not in Consul.
|
||||
- [ ] Overall cluster health is calculated based on the existence of exactly one Primary and healthy replication on all nodes.
|
||||
23
conductor/code_styleguides/general.md
Normal file
23
conductor/code_styleguides/general.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# General Code Style Principles
|
||||
|
||||
This document outlines general coding principles that apply across all languages and frameworks used in this project.
|
||||
|
||||
## Readability
|
||||
- Code should be easy to read and understand by humans.
|
||||
- Avoid overly clever or obscure constructs.
|
||||
|
||||
## Consistency
|
||||
- Follow existing patterns in the codebase.
|
||||
- Maintain consistent formatting, naming, and structure.
|
||||
|
||||
## Simplicity
|
||||
- Prefer simple solutions over complex ones.
|
||||
- Break down complex problems into smaller, manageable parts.
|
||||
|
||||
## Maintainability
|
||||
- Write code that is easy to modify and extend.
|
||||
- Minimize dependencies and coupling.
|
||||
|
||||
## Documentation
|
||||
- Document *why* something is done, not just *what*.
|
||||
- Keep documentation up-to-date with code changes.
|
||||
48
conductor/code_styleguides/go.md
Normal file
48
conductor/code_styleguides/go.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Effective Go Style Guide Summary
|
||||
|
||||
This document summarizes key rules and best practices from the official "Effective Go" guide for writing idiomatic Go code.
|
||||
|
||||
## 1. Formatting
|
||||
- **`gofmt`:** All Go code **must** be formatted with `gofmt` (or `go fmt`). This is a non-negotiable, automated standard.
|
||||
- **Indentation:** Use tabs for indentation (`gofmt` handles this).
|
||||
- **Line Length:** Go has no strict line length limit. Let `gofmt` handle line wrapping.
|
||||
|
||||
## 2. Naming
|
||||
- **`MixedCaps`:** Use `MixedCaps` or `mixedCaps` for multi-word names. Do not use underscores.
|
||||
- **Exported vs. Unexported:** Names starting with an uppercase letter are exported (public). Names starting with a lowercase letter are not exported (private).
|
||||
- **Package Names:** Short, concise, single-word, lowercase names.
|
||||
- **Getters:** Do not name getters with a `Get` prefix. A getter for a field named `owner` should be named `Owner()`.
|
||||
- **Interface Names:** One-method interfaces are named by the method name plus an `-er` suffix (e.g., `Reader`, `Writer`).
|
||||
|
||||
## 3. Control Structures
|
||||
- **`if`:** No parentheses around the condition. Braces are mandatory. Can include an initialization statement (e.g., `if err := file.Chmod(0664); err != nil`).
|
||||
- **`for`:** Go's only looping construct. Unifies `for` and `while`. Use `for...range` to iterate over slices, maps, strings, and channels.
|
||||
- **`switch`:** More general than in C. Cases do not fall through by default (use `fallthrough` explicitly). Can be used without an expression to function as a cleaner `if-else-if` chain.
|
||||
|
||||
## 4. Functions
|
||||
- **Multiple Returns:** Functions can return multiple values. This is the standard way to return a result and an error (e.g., `value, err`).
|
||||
- **Named Result Parameters:** Return parameters can be named. This can make code clearer and more concise.
|
||||
- **`defer`:** Schedules a function call to be run immediately before the function executing `defer` returns. Use it for cleanup tasks like closing files.
|
||||
|
||||
## 5. Data
|
||||
- **`new` vs. `make`:**
|
||||
- `new(T)`: Allocates memory for a new item of type `T`, zeroes it, and returns a pointer (`*T`).
|
||||
- `make(T, ...)`: Creates and initializes slices, maps, and channels only. Returns an initialized value of type `T` (not a pointer).
|
||||
- **Slices:** The preferred way to work with sequences. They are more flexible than arrays.
|
||||
- **Maps:** Use the "comma ok" idiom to check for the existence of a key: `value, ok := myMap[key]`.
|
||||
|
||||
## 6. Interfaces
|
||||
- **Implicit Implementation:** A type implements an interface by implementing its methods. No `implements` keyword is needed.
|
||||
- **Small Interfaces:** Prefer many small interfaces over one large one. The standard library is full of single-method interfaces (e.g., `io.Reader`).
|
||||
|
||||
## 7. Concurrency
|
||||
- **Share Memory By Communicating:** This is the core philosophy. Do not communicate by sharing memory; instead, share memory by communicating.
|
||||
- **Goroutines:** Lightweight, concurrently executing functions. Start one with the `go` keyword.
|
||||
- **Channels:** Typed conduits for communication between goroutines. Use `make` to create them.
|
||||
|
||||
## 8. Errors
|
||||
- **`error` type:** The built-in `error` interface is the standard way to handle errors.
|
||||
- **Explicit Error Handling:** Do not discard errors with the blank identifier (`_`). Check for errors explicitly.
|
||||
- **`panic`:** Reserved for truly exceptional, unrecoverable situations. Generally, libraries should not panic.
|
||||
|
||||
*Source: [Effective Go](https://go.dev/doc/effective_go)*
|
||||
14
conductor/index.md
Normal file
14
conductor/index.md
Normal file
@@ -0,0 +1,14 @@
|
||||
# Project Context
|
||||
|
||||
## Definition
|
||||
- [Product Definition](./product.md)
|
||||
- [Product Guidelines](./product-guidelines.md)
|
||||
- [Tech Stack](./tech-stack.md)
|
||||
|
||||
## Workflow
|
||||
- [Workflow](./workflow.md)
|
||||
- [Code Style Guides](./code_styleguides/)
|
||||
|
||||
## Management
|
||||
- [Tracks Registry](./tracks.md)
|
||||
- [Tracks Directory](./tracks/)
|
||||
17
conductor/product-guidelines.md
Normal file
17
conductor/product-guidelines.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Product Guidelines - JuiceNavidrome Infrastructure
|
||||
|
||||
## Vision
|
||||
To provide a rock-solid, self-healing, and highly available deployment model for Navidrome, treating the infrastructure as a first-class product.
|
||||
|
||||
## Operational Principles
|
||||
- **Infrastructure as Code:** All cluster configurations, job definitions, and container builds must be versioned and managed through Git.
|
||||
- **Failover-First Design:** The primary measure of success is the system's ability to automatically recover from node failures without manual intervention.
|
||||
- **Minimal State on Host:** Local host data should be limited to necessary LiteFS state and caches; all critical application data must be replicated or reside on shared storage.
|
||||
|
||||
## Automation Standards
|
||||
- **Automated Rebuilds:** Any change to the base configuration or Dockerfile must trigger an automatic build and push to the local registry via Gitea Actions.
|
||||
- **Safe Rollouts:** Use sequential update strategies (max_parallel=1) to ensure the cluster remains healthy during version transitions.
|
||||
|
||||
## Reliability & Monitoring
|
||||
- **Health-Aware Orchestration:** Leverage Consul health checks to provide Traefik with accurate routing information, ensuring traffic only hits ready nodes.
|
||||
- **Consistent Initialization:** Use clean bootstrap procedures to avoid common distributed system pitfalls like stale cluster IDs or locking conflicts.
|
||||
17
conductor/product.md
Normal file
17
conductor/product.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Initial Concept\n\nDeploy and manage a highly available Navidrome music server with replicated SQLite storage using LiteFS on a Nomad cluster.
|
||||
|
||||
## Overview
|
||||
A highly available and durable personal music streaming service built on Navidrome and LiteFS, orchestrated by Nomad.
|
||||
|
||||
## Target Audience
|
||||
- Personal use for a single music enthusiast who demands constant access to their library.
|
||||
|
||||
## Key Goals
|
||||
- **High Availability:** Ensure the music server remains accessible even if a cluster node fails, utilizing automatic failover.
|
||||
- **Data Durability:** Maintain multiple synchronous copies of the SQLite database across different physical nodes to prevent data loss.
|
||||
|
||||
## Core Features
|
||||
- **High-Quality Streaming:** Support for advanced audio formats and on-the-fly transcoding (Opus/FLAC) to ensure the best possible listening experience.
|
||||
- **Universal Compatibility:** Full support for the Subsonic API to allow connection from a wide variety of mobile and desktop music clients.
|
||||
- **Automated Infrastructure:** Managed by Nomad and Consul for seamless cluster operations and service discovery.
|
||||
- **Robust High Availability:** Automatic failover with TTL-based self-registration for clean and resilient service catalog management.
|
||||
1
conductor/setup_state.json
Normal file
1
conductor/setup_state.json
Normal file
@@ -0,0 +1 @@
|
||||
{"last_successful_step": "3.3_initial_track_generated"}
|
||||
22
conductor/tech-stack.md
Normal file
22
conductor/tech-stack.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Technology Stack - JuiceNavidrome
|
||||
|
||||
## Core Infrastructure
|
||||
- **Nomad:** Orchestrates the deployment of the Navidrome service across the cluster nodes.
|
||||
- **Consul:** Handles service discovery, health monitoring, and the distributed locking required for LiteFS leader election.
|
||||
- **Docker:** Provides the container runtime environment for Navidrome and LiteFS.
|
||||
|
||||
## Storage & Database
|
||||
- **SQLite:** The primary relational database used by Navidrome for metadata and state.
|
||||
- **LiteFS:** A FUSE-based filesystem that provides synchronous replication of the SQLite database across the cluster.
|
||||
- **Process Management:** LiteFS-supervised with a robust TTL-heartbeat registration script ensuring zero-downtime failover and clean service catalog management.
|
||||
|
||||
## Automation & Delivery
|
||||
- **Gitea Actions:** Automates the multi-arch (AMD64/ARM64) building and pushing of the custom supervised container image.
|
||||
- **Git:** Source control for all infrastructure-as-code artifacts.
|
||||
|
||||
## Networking
|
||||
- **Traefik:** Acts as the cluster's ingress controller, routing traffic based on Consul tags.
|
||||
- **LiteFS Proxy:** Handles transparent write-forwarding to the cluster leader.
|
||||
|
||||
## Monitoring & Tooling
|
||||
- **Python (Cluster Status Script):** A local CLI tool for hybrid monitoring using Nomad (discovery & uptime), Consul (service health), and LiteFS HTTP API (internal replication state).
|
||||
8
conductor/tracks.md
Normal file
8
conductor/tracks.md
Normal file
@@ -0,0 +1,8 @@
|
||||
# Project Tracks
|
||||
|
||||
This file tracks all major tracks for the project. Each track has its own detailed plan in its respective folder.
|
||||
---
|
||||
---
|
||||
|
||||
- [x] **Track: Update Monitor Discovery Logic**
|
||||
*Link: [./tracks/update_monitor_discovery_20260208/](./update_monitor_discovery_20260208/)*
|
||||
30
conductor/tracks/diagnose_and_enhance_20260208/plan.md
Normal file
30
conductor/tracks/diagnose_and_enhance_20260208/plan.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Plan: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
|
||||
|
||||
## Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]
|
||||
- [x] Task: Update `consul_client.py` to fetch detailed health check output
|
||||
- [x] Write tests for fetching `Output` field from Consul checks
|
||||
- [x] Implement logic to extract and store the `Output` (error message)
|
||||
- [x] Task: Update aggregator and formatter to display Consul errors
|
||||
- [x] Update aggregation logic to include `consul_error`
|
||||
- [x] Update table formatter to indicate an error (maybe a flag or color)
|
||||
- [x] Add a "Diagnostics" section to the output to print full error details
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]
|
||||
- [x] Task: Implement `nomad_client.py` wrapper
|
||||
- [x] Write tests for `get_allocation_logs`, `get_node_status`, and `restart_allocation` (mocking subprocess)
|
||||
- [x] Implement `subprocess.run(["nomad", ...])` logic to fetch logs and restart allocations
|
||||
- [x] Task: Integrate Nomad logs into diagnosis
|
||||
- [x] Update aggregator to call Nomad client for critical nodes
|
||||
- [x] Update "Diagnostics" section to display the last 20 lines of stderr
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Advanced LiteFS Status [ ]
|
||||
- [ ] Task: Implement `litefs_status` via `nomad alloc exec`
|
||||
- [ ] Write tests for executing remote commands via Nomad
|
||||
- [ ] Update `litefs_client.py` to fallback to `nomad alloc exec` if HTTP fails
|
||||
- [ ] Parse `litefs status` output (text/json) to extract uptime and replication lag
|
||||
- [ ] Task: Final Polish and Diagnosis Run
|
||||
- [ ] Ensure all pieces work together
|
||||
- [ ] Run the script to diagnose `odroid8`
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)
|
||||
22
conductor/tracks/fix_litefs_config_20260208/plan.md
Normal file
22
conductor/tracks/fix_litefs_config_20260208/plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Plan: Fix LiteFS Configuration and Process Management (`fix_litefs_config`)
|
||||
|
||||
## Phase 1: Configuration and Image Structure [ ]
|
||||
- [x] Task: Update `litefs.yml` to include the `exec` block (396dfeb)
|
||||
- [x] Task: Update `Dockerfile` to use LiteFS as the supervisor (`ENTRYPOINT ["litefs", "mount"]`) (ef91b8e)
|
||||
- [x] Task: Update `navidrome-litefs-v2.nomad` with corrected storage paths (`ND_DATAFOLDER`, `ND_CACHEFOLDER`, `ND_BACKUP_PATH`) (5cbb657)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 1: Configuration and Image Structure' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Entrypoint and Registration Logic [x] [checkpoint: 9cd5455]
|
||||
- [x] Task: Refactor `entrypoint.sh` to handle leadership-aware process management (9cd5455)
|
||||
- [x] Integrate Consul registration logic (from `register.sh`)
|
||||
- [x] Implement loop to start/stop Navidrome based on `/data/.primary` existence
|
||||
- [x] Ensure proper signal handling for Navidrome shutdown
|
||||
- [x] Task: Clean up redundant scripts (e.g., `register.sh` if fully integrated) (9cd5455)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 2: Entrypoint and Registration Logic' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Failover Verification [ ]
|
||||
- [ ] Task: Build and push the updated Docker image via Gitea Actions (if possible) or manual trigger
|
||||
- [ ] Task: Deploy the updated Nomad job
|
||||
- [ ] Task: Verify cluster health and process distribution using `cluster_status` script
|
||||
- [ ] Task: Perform a manual failover (stop primary allocation) and verify Navidrome migrates correctly
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Failover Verification' (Protocol in workflow.md)
|
||||
17
conductor/tracks/fix_navidrome_paths_20260209/plan.md
Normal file
17
conductor/tracks/fix_navidrome_paths_20260209/plan.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Plan: Correct Navidrome Database and Plugins Location (`fix_navidrome_paths`)
|
||||
|
||||
## Phase 1: Configuration Updates [x]
|
||||
- [x] Task: Update `navidrome-litefs-v2.nomad` with corrected paths (76398de)
|
||||
- [x] Task: Update `entrypoint.sh` to handle plugins folder and environment cleanup (decb9f5)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Configuration Updates' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Build and Deployment [ ]
|
||||
- [ ] Task: Commit changes and push to Gitea to trigger build
|
||||
- [ ] Task: Monitor Gitea build completion
|
||||
- [ ] Task: Deploy updated Nomad job
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 2: Build and Deployment' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Final Verification [ ]
|
||||
- [ ] Task: Verify database path via `lsof` on the Primary node
|
||||
- [ ] Task: Verify replication health using `cluster_status` script
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
26
conductor/tracks/fix_odroid8_and_script_20260208/plan.md
Normal file
26
conductor/tracks/fix_odroid8_and_script_20260208/plan.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
|
||||
|
||||
## Phase 1: Script Robustness [x] [checkpoint: 860000b]
|
||||
- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully
|
||||
- [x] Write tests for handling Nomad CLI absence/failure
|
||||
- [x] Update implementation to return descriptive error objects or `None` without crashing
|
||||
- [x] Task: Update aggregator and formatter to handle Nomad errors
|
||||
- [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail
|
||||
- [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells
|
||||
- [x] Add a global "Nomad Connectivity Warning" to the summary
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Odroid8 Recovery [ ]
|
||||
- [x] Task: Identify and verify `odroid8` LiteFS data path
|
||||
- [x] Run `nomad alloc status` to find the volume mount for `odroid8`
|
||||
- [x] Provide the user with the exact host path to the LiteFS data
|
||||
- [x] Task: Guide user through manual cleanup
|
||||
- [x] Provide steps to stop the allocation
|
||||
- [x] Provide the `rm` command to clear the LiteFS metadata
|
||||
- [x] Provide steps to restart and verify the node
|
||||
- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Final Verification [x]
|
||||
- [x] Task: Final verification run of the script
|
||||
- [x] Task: Verify cluster health in Consul and LiteFS API
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
22
conductor/tracks/implement_ttl_heartbeat_20260208/plan.md
Normal file
22
conductor/tracks/implement_ttl_heartbeat_20260208/plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Plan: Implement TTL Heartbeat Service Registration (`implement_ttl_heartbeat`)
|
||||
|
||||
## Phase 1: Container Environment Preparation [x] [checkpoint: 51b8fce]
|
||||
- [x] Task: Update `Dockerfile` to install `curl` and `jq` (f7fe258)
|
||||
- [x] Task: Verify `litefs.yml` points to `entrypoint.sh` (should already be correct) (verified)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Container Environment Preparation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Script Implementation [x] [checkpoint: 139016f]
|
||||
- [x] Task: Refactor `entrypoint.sh` with the TTL Heartbeat logic (d977301)
|
||||
- [x] Implement `register_service` with TTL check definition
|
||||
- [x] Implement `pass_ttl` loop
|
||||
- [x] Implement robust `stop_app` and signal trapping
|
||||
- [x] Ensure correct Primary/Replica detection logic (LiteFS 0.5: Primary = No `.primary` file)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Script Implementation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Verification [ ]
|
||||
- [ ] Task: Commit changes and push to Gitea to trigger build
|
||||
- [ ] Task: Monitor Gitea build completion
|
||||
- [ ] Task: Deploy updated Nomad job (forcing update if necessary)
|
||||
- [ ] Task: Verify "Clean" state in Consul (only one primary registered)
|
||||
- [ ] Task: Verify Failover/Stop behavior (immediate deregistration vs TTL expiry)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Verification' (Protocol in workflow.md)
|
||||
23
conductor/tracks/update_monitor_discovery_20260208/plan.md
Normal file
23
conductor/tracks/update_monitor_discovery_20260208/plan.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Plan: Update Monitor Discovery Logic (`update_monitor_discovery`)
|
||||
|
||||
## Phase 1: Nomad Discovery Enhancement [x] [checkpoint: 353683e]
|
||||
- [x] Task: Update `nomad_client.py` to fetch job allocations with IPs (353683e)
|
||||
- [x] Write tests for parsing allocation IPs from `nomad job status` or `nomad alloc status`
|
||||
- [x] Implement `get_job_allocations(job_id)` returning a list of dicts (id, node, ip)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Nomad Discovery Enhancement' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Aggregator Refactor [x] [checkpoint: 655a9b2]
|
||||
- [x] Task: Refactor `cluster_aggregator.py` to drive discovery via Nomad (655a9b2)
|
||||
- [x] Update `get_cluster_status` to call `nomad_client.get_job_allocations` first
|
||||
- [x] Update loop to iterate over allocations and supplement with LiteFS and Consul data
|
||||
- [x] Task: Update `consul_client.py` to fetch all services once and allow lookup by IP/ID (655a9b2)
|
||||
- [x] Task: Update tests for the new discovery flow (655a9b2)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Aggregator Refactor' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: UI and Health Logic [x] [checkpoint: 21e9c3d]
|
||||
- [x] Task: Update `output_formatter.py` for "Standby" nodes (21e9c3d)
|
||||
- [x] Update table formatting to handle missing Consul status for replicas
|
||||
- [x] Task: Update Cluster Health calculation (21e9c3d)
|
||||
- [x] "Healthy" = 1 Primary (Consul passing) + N Replicas (LiteFS connected)
|
||||
- [x] Task: Final verification run (21e9c3d)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
333
conductor/workflow.md
Normal file
333
conductor/workflow.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Project Workflow
|
||||
|
||||
## Guiding Principles
|
||||
|
||||
1. **The Plan is the Source of Truth:** All work must be tracked in `plan.md`
|
||||
2. **The Tech Stack is Deliberate:** Changes to the tech stack must be documented in `tech-stack.md` *before* implementation
|
||||
3. **Test-Driven Development:** Write unit tests before implementing functionality
|
||||
4. **High Code Coverage:** Aim for >80% code coverage for all modules
|
||||
5. **User Experience First:** Every decision should prioritize user experience
|
||||
6. **Non-Interactive & CI-Aware:** Prefer non-interactive commands. Use `CI=true` for watch-mode tools (tests, linters) to ensure single execution.
|
||||
|
||||
## Task Workflow
|
||||
|
||||
All tasks follow a strict lifecycle:
|
||||
|
||||
### Standard Task Workflow
|
||||
|
||||
1. **Select Task:** Choose the next available task from `plan.md` in sequential order
|
||||
|
||||
2. **Mark In Progress:** Before beginning work, edit `plan.md` and change the task from `[ ]` to `[~]`
|
||||
|
||||
3. **Write Failing Tests (Red Phase):**
|
||||
- Create a new test file for the feature or bug fix.
|
||||
- Write one or more unit tests that clearly define the expected behavior and acceptance criteria for the task.
|
||||
- **CRITICAL:** Run the tests and confirm that they fail as expected. This is the "Red" phase of TDD. Do not proceed until you have failing tests.
|
||||
|
||||
4. **Implement to Pass Tests (Green Phase):**
|
||||
- Write the minimum amount of application code necessary to make the failing tests pass.
|
||||
- Run the test suite again and confirm that all tests now pass. This is the "Green" phase.
|
||||
|
||||
5. **Refactor (Optional but Recommended):**
|
||||
- With the safety of passing tests, refactor the implementation code and the test code to improve clarity, remove duplication, and enhance performance without changing the external behavior.
|
||||
- Rerun tests to ensure they still pass after refactoring.
|
||||
|
||||
6. **Verify Coverage:** Run coverage reports using the project's chosen tools. For example, in a Python project, this might look like:
|
||||
```bash
|
||||
pytest --cov=app --cov-report=html
|
||||
```
|
||||
Target: >80% coverage for new code. The specific tools and commands will vary by language and framework.
|
||||
|
||||
7. **Document Deviations:** If implementation differs from tech stack:
|
||||
- **STOP** implementation
|
||||
- Update `tech-stack.md` with new design
|
||||
- Add dated note explaining the change
|
||||
- Resume implementation
|
||||
|
||||
8. **Commit Code Changes:**
|
||||
- Stage all code changes related to the task.
|
||||
- Propose a clear, concise commit message e.g, `feat(ui): Create basic HTML structure for calculator`.
|
||||
- Perform the commit.
|
||||
|
||||
9. **Attach Task Summary with Git Notes:**
|
||||
- **Step 9.1: Get Commit Hash:** Obtain the hash of the *just-completed commit* (`git log -1 --format="%H"`).
|
||||
- **Step 9.2: Draft Note Content:** Create a detailed summary for the completed task. This should include the task name, a summary of changes, a list of all created/modified files, and the core "why" for the change.
|
||||
- **Step 9.3: Attach Note:** Use the `git notes` command to attach the summary to the commit.
|
||||
```bash
|
||||
# The note content from the previous step is passed via the -m flag.
|
||||
git notes add -m "<note content>" <commit_hash>
|
||||
```
|
||||
|
||||
10. **Get and Record Task Commit SHA:**
|
||||
- **Step 10.1: Update Plan:** Read `plan.md`, find the line for the completed task, update its status from `[~]` to `[x]`, and append the first 7 characters of the *just-completed commit's* commit hash.
|
||||
- **Step 10.2: Write Plan:** Write the updated content back to `plan.md`.
|
||||
|
||||
11. **Commit Plan Update:**
|
||||
- **Action:** Stage the modified `plan.md` file.
|
||||
- **Action:** Commit this change with a descriptive message (e.g., `conductor(plan): Mark task 'Create user model' as complete`).
|
||||
|
||||
### Phase Completion Verification and Checkpointing Protocol
|
||||
|
||||
**Trigger:** This protocol is executed immediately after a task is completed that also concludes a phase in `plan.md`.
|
||||
|
||||
1. **Announce Protocol Start:** Inform the user that the phase is complete and the verification and checkpointing protocol has begun.
|
||||
|
||||
2. **Ensure Test Coverage for Phase Changes:**
|
||||
- **Step 2.1: Determine Phase Scope:** To identify the files changed in this phase, you must first find the starting point. Read `plan.md` to find the Git commit SHA of the *previous* phase's checkpoint. If no previous checkpoint exists, the scope is all changes since the first commit.
|
||||
- **Step 2.2: List Changed Files:** Execute `git diff --name-only <previous_checkpoint_sha> HEAD` to get a precise list of all files modified during this phase.
|
||||
- **Step 2.3: Verify and Create Tests:** For each file in the list:
|
||||
- **CRITICAL:** First, check its extension. Exclude non-code files (e.g., `.json`, `.md`, `.yaml`).
|
||||
- For each remaining code file, verify a corresponding test file exists.
|
||||
- If a test file is missing, you **must** create one. Before writing the test, **first, analyze other test files in the repository to determine the correct naming convention and testing style.** The new tests **must** validate the functionality described in this phase's tasks (`plan.md`).
|
||||
|
||||
3. **Execute Automated Tests with Proactive Debugging:**
|
||||
- Before execution, you **must** announce the exact shell command you will use to run the tests.
|
||||
- **Example Announcement:** "I will now run the automated test suite to verify the phase. **Command:** `CI=true npm test`"
|
||||
- Execute the announced command.
|
||||
- If tests fail, you **must** inform the user and begin debugging. You may attempt to propose a fix a **maximum of two times**. If the tests still fail after your second proposed fix, you **must stop**, report the persistent failure, and ask the user for guidance.
|
||||
|
||||
4. **Propose a Detailed, Actionable Manual Verification Plan:**
|
||||
- **CRITICAL:** To generate the plan, first analyze `product.md`, `product-guidelines.md`, and `plan.md` to determine the user-facing goals of the completed phase.
|
||||
- You **must** generate a step-by-step plan that walks the user through the verification process, including any necessary commands and specific, expected outcomes.
|
||||
- The plan you present to the user **must** follow this format:
|
||||
|
||||
**For a Frontend Change:**
|
||||
```
|
||||
The automated tests have passed. For manual verification, please follow these steps:
|
||||
|
||||
**Manual Verification Steps:**
|
||||
1. **Start the development server with the command:** `npm run dev`
|
||||
2. **Open your browser to:** `http://localhost:3000`
|
||||
3. **Confirm that you see:** The new user profile page, with the user's name and email displayed correctly.
|
||||
```
|
||||
|
||||
**For a Backend Change:**
|
||||
```
|
||||
The automated tests have passed. For manual verification, please follow these steps:
|
||||
|
||||
**Manual Verification Steps:**
|
||||
1. **Ensure the server is running.**
|
||||
2. **Execute the following command in your terminal:** `curl -X POST http://localhost:8080/api/v1/users -d '{"name": "test"}'`
|
||||
3. **Confirm that you receive:** A JSON response with a status of `201 Created`.
|
||||
```
|
||||
|
||||
5. **Await Explicit User Feedback:**
|
||||
- After presenting the detailed plan, ask the user for confirmation: "**Does this meet your expectations? Please confirm with yes or provide feedback on what needs to be changed.**"
|
||||
- **PAUSE** and await the user's response. Do not proceed without an explicit yes or confirmation.
|
||||
|
||||
6. **Create Checkpoint Commit:**
|
||||
- Stage all changes. If no changes occurred in this step, proceed with an empty commit.
|
||||
- Perform the commit with a clear and concise message (e.g., `conductor(checkpoint): Checkpoint end of Phase X`).
|
||||
|
||||
7. **Attach Auditable Verification Report using Git Notes:**
|
||||
- **Step 7.1: Draft Note Content:** Create a detailed verification report including the automated test command, the manual verification steps, and the user's confirmation.
|
||||
- **Step 7.2: Attach Note:** Use the `git notes` command and the full commit hash from the previous step to attach the full report to the checkpoint commit.
|
||||
|
||||
8. **Get and Record Phase Checkpoint SHA:**
|
||||
- **Step 8.1: Get Commit Hash:** Obtain the hash of the *just-created checkpoint commit* (`git log -1 --format="%H"`).
|
||||
- **Step 8.2: Update Plan:** Read `plan.md`, find the heading for the completed phase, and append the first 7 characters of the commit hash in the format `[checkpoint: <sha>]`.
|
||||
- **Step 8.3: Write Plan:** Write the updated content back to `plan.md`.
|
||||
|
||||
9. **Commit Plan Update:**
|
||||
- **Action:** Stage the modified `plan.md` file.
|
||||
- **Action:** Commit this change with a descriptive message following the format `conductor(plan): Mark phase '<PHASE NAME>' as complete`.
|
||||
|
||||
10. **Announce Completion:** Inform the user that the phase is complete and the checkpoint has been created, with the detailed verification report attached as a git note.
|
||||
|
||||
### Quality Gates
|
||||
|
||||
Before marking any task complete, verify:
|
||||
|
||||
- [ ] All tests pass
|
||||
- [ ] Code coverage meets requirements (>80%)
|
||||
- [ ] Code follows project's code style guidelines (as defined in `code_styleguides/`)
|
||||
- [ ] All public functions/methods are documented (e.g., docstrings, JSDoc, GoDoc)
|
||||
- [ ] Type safety is enforced (e.g., type hints, TypeScript types, Go types)
|
||||
- [ ] No linting or static analysis errors (using the project's configured tools)
|
||||
- [ ] Works correctly on mobile (if applicable)
|
||||
- [ ] Documentation updated if needed
|
||||
- [ ] No security vulnerabilities introduced
|
||||
|
||||
## Development Commands
|
||||
|
||||
**AI AGENT INSTRUCTION: This section should be adapted to the project's specific language, framework, and build tools.**
|
||||
|
||||
### Setup
|
||||
```bash
|
||||
# Example: Commands to set up the development environment (e.g., install dependencies, configure database)
|
||||
# e.g., for a Node.js project: npm install
|
||||
# e.g., for a Go project: go mod tidy
|
||||
```
|
||||
|
||||
### Daily Development
|
||||
```bash
|
||||
# Example: Commands for common daily tasks (e.g., start dev server, run tests, lint, format)
|
||||
# e.g., for a Node.js project: npm run dev, npm test, npm run lint
|
||||
# e.g., for a Go project: go run main.go, go test ./..., go fmt ./...
|
||||
```
|
||||
|
||||
### Before Committing
|
||||
```bash
|
||||
# Example: Commands to run all pre-commit checks (e.g., format, lint, type check, run tests)
|
||||
# e.g., for a Node.js project: npm run check
|
||||
# e.g., for a Go project: make check (if a Makefile exists)
|
||||
```
|
||||
|
||||
## Testing Requirements
|
||||
|
||||
### Unit Testing
|
||||
- Every module must have corresponding tests.
|
||||
- Use appropriate test setup/teardown mechanisms (e.g., fixtures, beforeEach/afterEach).
|
||||
- Mock external dependencies.
|
||||
- Test both success and failure cases.
|
||||
|
||||
### Integration Testing
|
||||
- Test complete user flows
|
||||
- Verify database transactions
|
||||
- Test authentication and authorization
|
||||
- Check form submissions
|
||||
|
||||
### Mobile Testing
|
||||
- Test on actual iPhone when possible
|
||||
- Use Safari developer tools
|
||||
- Test touch interactions
|
||||
- Verify responsive layouts
|
||||
- Check performance on 3G/4G
|
||||
|
||||
## Code Review Process
|
||||
|
||||
### Self-Review Checklist
|
||||
Before requesting review:
|
||||
|
||||
1. **Functionality**
|
||||
- Feature works as specified
|
||||
- Edge cases handled
|
||||
- Error messages are user-friendly
|
||||
|
||||
2. **Code Quality**
|
||||
- Follows style guide
|
||||
- DRY principle applied
|
||||
- Clear variable/function names
|
||||
- Appropriate comments
|
||||
|
||||
3. **Testing**
|
||||
- Unit tests comprehensive
|
||||
- Integration tests pass
|
||||
- Coverage adequate (>80%)
|
||||
|
||||
4. **Security**
|
||||
- No hardcoded secrets
|
||||
- Input validation present
|
||||
- SQL injection prevented
|
||||
- XSS protection in place
|
||||
|
||||
5. **Performance**
|
||||
- Database queries optimized
|
||||
- Images optimized
|
||||
- Caching implemented where needed
|
||||
|
||||
6. **Mobile Experience**
|
||||
- Touch targets adequate (44x44px)
|
||||
- Text readable without zooming
|
||||
- Performance acceptable on mobile
|
||||
- Interactions feel native
|
||||
|
||||
## Commit Guidelines
|
||||
|
||||
### Message Format
|
||||
```
|
||||
<type>(<scope>): <description>
|
||||
|
||||
[optional body]
|
||||
|
||||
[optional footer]
|
||||
```
|
||||
|
||||
### Types
|
||||
- `feat`: New feature
|
||||
- `fix`: Bug fix
|
||||
- `docs`: Documentation only
|
||||
- `style`: Formatting, missing semicolons, etc.
|
||||
- `refactor`: Code change that neither fixes a bug nor adds a feature
|
||||
- `test`: Adding missing tests
|
||||
- `chore`: Maintenance tasks
|
||||
|
||||
### Examples
|
||||
```bash
|
||||
git commit -m "feat(auth): Add remember me functionality"
|
||||
git commit -m "fix(posts): Correct excerpt generation for short posts"
|
||||
git commit -m "test(comments): Add tests for emoji reaction limits"
|
||||
git commit -m "style(mobile): Improve button touch targets"
|
||||
```
|
||||
|
||||
## Definition of Done
|
||||
|
||||
A task is complete when:
|
||||
|
||||
1. All code implemented to specification
|
||||
2. Unit tests written and passing
|
||||
3. Code coverage meets project requirements
|
||||
4. Documentation complete (if applicable)
|
||||
5. Code passes all configured linting and static analysis checks
|
||||
6. Works beautifully on mobile (if applicable)
|
||||
7. Implementation notes added to `plan.md`
|
||||
8. Changes committed with proper message
|
||||
9. Git note with task summary attached to the commit
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Critical Bug in Production
|
||||
1. Create hotfix branch from main
|
||||
2. Write failing test for bug
|
||||
3. Implement minimal fix
|
||||
4. Test thoroughly including mobile
|
||||
5. Deploy immediately
|
||||
6. Document in plan.md
|
||||
|
||||
### Data Loss
|
||||
1. Stop all write operations
|
||||
2. Restore from latest backup
|
||||
3. Verify data integrity
|
||||
4. Document incident
|
||||
5. Update backup procedures
|
||||
|
||||
### Security Breach
|
||||
1. Rotate all secrets immediately
|
||||
2. Review access logs
|
||||
3. Patch vulnerability
|
||||
4. Notify affected users (if any)
|
||||
5. Document and update security procedures
|
||||
|
||||
## Deployment Workflow
|
||||
|
||||
### Pre-Deployment Checklist
|
||||
- [ ] All tests passing
|
||||
- [ ] Coverage >80%
|
||||
- [ ] No linting errors
|
||||
- [ ] Mobile testing complete
|
||||
- [ ] Environment variables configured
|
||||
- [ ] Database migrations ready
|
||||
- [ ] Backup created
|
||||
|
||||
### Deployment Steps
|
||||
1. Merge feature branch to main
|
||||
2. Tag release with version
|
||||
3. Push to deployment service
|
||||
4. Run database migrations
|
||||
5. Verify deployment
|
||||
6. Test critical paths
|
||||
7. Monitor for errors
|
||||
|
||||
### Post-Deployment
|
||||
1. Monitor analytics
|
||||
2. Check error logs
|
||||
3. Gather user feedback
|
||||
4. Plan next iteration
|
||||
|
||||
## Continuous Improvement
|
||||
|
||||
- Review workflow weekly
|
||||
- Update based on pain points
|
||||
- Document lessons learned
|
||||
- Optimize for user happiness
|
||||
- Keep things simple and maintainable
|
||||
122
entrypoint.sh
Normal file
122
entrypoint.sh
Normal file
@@ -0,0 +1,122 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# Configuration from environment
|
||||
SERVICE_NAME="navidrome"
|
||||
# Use Nomad allocation ID for a unique service ID
|
||||
SERVICE_ID="${SERVICE_NAME}-${NOMAD_ALLOC_ID:-$(hostname)}"
|
||||
PORT=4533
|
||||
CONSUL_HTTP_ADDR="${CONSUL_URL:-http://localhost:8500}"
|
||||
NODE_IP="${ADVERTISE_IP}"
|
||||
DB_LOCK_FILE="/data/.primary"
|
||||
NAVIDROME_PID=0
|
||||
|
||||
# Tags for the Primary service (Traefik enabled)
|
||||
PRIMARY_TAGS='["navidrome","web","traefik.enable=true","urlprefix-/navidrome","tools","traefik.http.routers.navidromelan.rule=Host(`navidrome.service.dc1.consul`)","traefik.http.routers.navidromewan.rule=Host(`m.fbleagh.duckdns.org`)","traefik.http.routers.navidromewan.middlewares=dex@consulcatalog","traefik.http.routers.navidromewan.tls=true"]'
|
||||
|
||||
# --- Helper Functions ---
|
||||
|
||||
# Register Service with TTL Check
|
||||
register_service() {
|
||||
echo "Promoted! Registering service ${SERVICE_ID}..."
|
||||
# Convert bash list string to JSON array if needed, but PRIMARY_TAGS is already JSON-like
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/register" -d "{
|
||||
\"ID\": \"${SERVICE_ID}\",
|
||||
\"Name\": \"${SERVICE_NAME}\",
|
||||
\"Tags\": ${PRIMARY_TAGS},
|
||||
\"Address\": \"${NODE_IP}\",
|
||||
\"Port\": ${PORT},
|
||||
\"Check\": {
|
||||
\"DeregisterCriticalServiceAfter\": \"1m\",
|
||||
\"TTL\": \"15s\"
|
||||
}
|
||||
}"
|
||||
}
|
||||
|
||||
# Send Heartbeat to Consul
|
||||
pass_ttl() {
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/check/pass/service:${SERVICE_ID}" > /dev/null
|
||||
}
|
||||
|
||||
# Deregister Service
|
||||
deregister_service() {
|
||||
echo "Demoted/Stopping. Deregistering service ${SERVICE_ID}..."
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/deregister/${SERVICE_ID}"
|
||||
}
|
||||
|
||||
# Start Navidrome in Background
|
||||
start_app() {
|
||||
echo "Node is Primary. Starting Navidrome..."
|
||||
|
||||
# Ensure shared directories exist
|
||||
mkdir -p /shared_data/plugins /shared_data/cache /shared_data/backup
|
||||
|
||||
/app/navidrome &
|
||||
NAVIDROME_PID=$!
|
||||
echo "Navidrome started with PID ${NAVIDROME_PID}"
|
||||
}
|
||||
|
||||
# Stop Navidrome
|
||||
stop_app() {
|
||||
if [ "${NAVIDROME_PID}" -gt 0 ]; then
|
||||
echo "Stopping Navidrome (PID ${NAVIDROME_PID})..."
|
||||
kill -SIGTERM "${NAVIDROME_PID}"
|
||||
wait "${NAVIDROME_PID}" 2>/dev/null || true
|
||||
NAVIDROME_PID=0
|
||||
fi
|
||||
}
|
||||
|
||||
# --- Signal Handling (The Safety Net) ---
|
||||
# If Nomad stops the container, we stop the app and deregister.
|
||||
cleanup() {
|
||||
echo "Caught signal, shutting down..."
|
||||
stop_app
|
||||
deregister_service
|
||||
exit 0
|
||||
}
|
||||
|
||||
trap cleanup TERM INT
|
||||
|
||||
# --- Main Loop ---
|
||||
|
||||
echo "Starting Supervisor. Waiting for leadership settle..."
|
||||
echo "Node IP: $NODE_IP"
|
||||
echo "Consul: $CONSUL_HTTP_ADDR"
|
||||
|
||||
# Small sleep to let LiteFS settle and leadership election complete
|
||||
sleep 5
|
||||
|
||||
while true; do
|
||||
# In LiteFS 0.5, .primary file exists ONLY on replicas.
|
||||
if [ ! -f "$DB_LOCK_FILE" ]; then
|
||||
# === WE ARE PRIMARY ===
|
||||
|
||||
# 1. If App is not running, start it and register
|
||||
if [ "${NAVIDROME_PID}" -eq 0 ] || ! kill -0 "${NAVIDROME_PID}" 2>/dev/null; then
|
||||
if [ "${NAVIDROME_PID}" -gt 0 ]; then
|
||||
echo "CRITICAL: Navidrome crashed! Restarting..."
|
||||
fi
|
||||
start_app
|
||||
register_service
|
||||
fi
|
||||
|
||||
# 2. Maintain the heartbeat (TTL)
|
||||
pass_ttl
|
||||
|
||||
else
|
||||
# === WE ARE REPLICA ===
|
||||
|
||||
# If App is running (we were just demoted), stop it
|
||||
if [ "${NAVIDROME_PID}" -gt 0 ]; then
|
||||
echo "Lost leadership. Demoting..."
|
||||
stop_app
|
||||
deregister_service
|
||||
fi
|
||||
|
||||
# No service registration exists for replicas to keep Consul clean.
|
||||
fi
|
||||
|
||||
# Sleep short enough to update TTL (every 5s is safe for 15s TTL)
|
||||
sleep 5 &
|
||||
wait $! # Wait allows the 'trap' to interrupt the sleep instantly
|
||||
done
|
||||
26
host-check.nomad
Normal file
26
host-check.nomad
Normal file
@@ -0,0 +1,26 @@
|
||||
job "host-check" {
|
||||
datacenters = ["dc1"]
|
||||
type = "batch"
|
||||
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
value = "odroid7"
|
||||
}
|
||||
|
||||
group "check" {
|
||||
task "ss" {
|
||||
driver = "raw_exec"
|
||||
config {
|
||||
command = "ss"
|
||||
args = ["-tln"]
|
||||
}
|
||||
}
|
||||
task "ufw" {
|
||||
driver = "raw_exec"
|
||||
config {
|
||||
command = "ufw"
|
||||
args = ["status"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -1,38 +0,0 @@
|
||||
job "jfs-controller" {
|
||||
datacenters = ["dc1"]
|
||||
type = "system"
|
||||
|
||||
group "controller" {
|
||||
task "plugin" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "juicedata/juicefs-csi-driver:v0.31.1"
|
||||
|
||||
args = [
|
||||
"--endpoint=unix://csi/csi.sock",
|
||||
"--logtostderr",
|
||||
"--nodeid=test",
|
||||
"--v=5",
|
||||
"--by-process=true"
|
||||
]
|
||||
|
||||
privileged = true
|
||||
}
|
||||
|
||||
csi_plugin {
|
||||
id = "juicefs0"
|
||||
type = "controller"
|
||||
mount_dir = "/csi"
|
||||
}
|
||||
resources {
|
||||
cpu = 100
|
||||
memory = 512
|
||||
}
|
||||
env {
|
||||
POD_NAME = "csi-controller"
|
||||
POD_NAMESPACE = "default"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -1,63 +0,0 @@
|
||||
job "jfs-node" {
|
||||
datacenters = ["dc1"]
|
||||
type = "system"
|
||||
|
||||
group "nodes" {
|
||||
network {
|
||||
port "metrics" {
|
||||
static = 9567
|
||||
to = 8080
|
||||
}
|
||||
}
|
||||
|
||||
service {
|
||||
name = "juicefs-metrics"
|
||||
port = "metrics"
|
||||
tags = ["prometheus"]
|
||||
check {
|
||||
type = "http"
|
||||
path = "/metrics"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
}
|
||||
}
|
||||
|
||||
task "juicefs-plugin" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "juicedata/juicefs-csi-driver:v0.31.1"
|
||||
memory_hard_limit = 2048
|
||||
ports = ["metrics"]
|
||||
args = [
|
||||
"--endpoint=unix://csi/csi.sock",
|
||||
"--logtostderr",
|
||||
"--v=5",
|
||||
"--nodeid=${node.unique.name}",
|
||||
"--by-process=true",
|
||||
]
|
||||
|
||||
privileged = true
|
||||
}
|
||||
|
||||
csi_plugin {
|
||||
id = "juicefs0"
|
||||
type = "node"
|
||||
mount_dir = "/csi"
|
||||
health_timeout = "3m"
|
||||
}
|
||||
resources {
|
||||
cpu = 100
|
||||
memory = 100
|
||||
}
|
||||
env {
|
||||
POD_NAME = "csi-node"
|
||||
POD_NAMESPACE = "default"
|
||||
# Aggregates metrics from children onto the 8080 port
|
||||
JFS_METRICS = "0.0.0.0:8080"
|
||||
# Ensures mounts run as background processes managed by the driver
|
||||
JFS_MOUNT_MODE = "process"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
10
litefs.yml
10
litefs.yml
@@ -15,11 +15,9 @@ lease:
|
||||
|
||||
# Internal HTTP API for replication
|
||||
http:
|
||||
addr: ":20202"
|
||||
addr: "0.0.0.0:20202"
|
||||
|
||||
# The HTTP Proxy routes traffic to handle write-forwarding
|
||||
# It listens on 8080 inside the container.
|
||||
# Nomad will map host 4533 to this port.
|
||||
proxy:
|
||||
addr: ":8080"
|
||||
target: "localhost:4533"
|
||||
@@ -31,4 +29,8 @@ proxy:
|
||||
- "*.jpg"
|
||||
- "*.jpeg"
|
||||
- "*.gif"
|
||||
- "*.svg"
|
||||
- "*.svg"
|
||||
|
||||
# Commands to run only on the primary node.
|
||||
exec:
|
||||
- cmd: "/usr/local/bin/entrypoint.sh"
|
||||
|
||||
@@ -1,92 +0,0 @@
|
||||
job "navidrome" {
|
||||
datacenters = ["dc1"]
|
||||
type = "service"
|
||||
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
operator = "regexp"
|
||||
value = "odroid.*"
|
||||
}
|
||||
|
||||
group "navidrome" {
|
||||
count = 1
|
||||
|
||||
volume "navidrome-csi-vol" {
|
||||
type = "csi"
|
||||
source = "navidrome-volume" # This must match the 'id' in your volume registration
|
||||
attachment_mode = "file-system"
|
||||
access_mode = "multi-node-multi-writer"
|
||||
}
|
||||
|
||||
|
||||
|
||||
# Main Navidrome task
|
||||
task "navidrome" {
|
||||
driver = "docker"
|
||||
|
||||
volume_mount {
|
||||
volume = "navidrome-csi-vol" # Matches the name in the volume block above
|
||||
destination = "/data" # Where it appears inside the container
|
||||
read_only = false
|
||||
}
|
||||
|
||||
|
||||
config {
|
||||
image = "ghcr.io/navidrome/navidrome:latest"
|
||||
memory_hard_limit = "2048"
|
||||
ports = ["http"]
|
||||
volumes = [
|
||||
"/mnt/Public/Downloads/Clean_Music:/music/CleanMusic:ro",
|
||||
"/mnt/Public/Downloads/news/slskd/downloads:/music/slskd:ro",
|
||||
"/mnt/Public/Downloads/incoming_music:/music/incomingmusic:ro"
|
||||
]
|
||||
}
|
||||
env {
|
||||
ND_DATAFOLDER = "/data"
|
||||
ND_CACHEFOLDER = "/data/cache"
|
||||
ND_CONFIGFILE= "/data/navidrome.toml"
|
||||
ND_DBPATH = "/data/navidrome.db?_busy_timeout=30000&_journal_mode=DELETE&_foreign_keys=on&synchronous=NORMAL&cache=shared&nolock=1"
|
||||
ND_SCANSCHEDULE = "32 8-20 * * *"
|
||||
ND_LOGLEVEL = "trace"
|
||||
ND_REVERSEPROXYWHITELIST = "0.0.0.0/0"
|
||||
ND_REVERSEPROXYUSERHEADER = "X-Forwarded-User"
|
||||
ND_SCANNER_GROUPALBUMRELEASES = "False"
|
||||
ND_BACKUP_PATH = "/data/backups"
|
||||
ND_BACKUP_SCHEDULE = "0 0 * * *"
|
||||
ND_BACKUP_COUNT = "7"
|
||||
}
|
||||
resources {
|
||||
cpu = 100
|
||||
memory = 128
|
||||
}
|
||||
service {
|
||||
name = "navidrome"
|
||||
tags = [
|
||||
"navidrome",
|
||||
"web",
|
||||
"urlprefix-/navidrome",
|
||||
"tools",
|
||||
"traefik.http.routers.navidromelan.rule=Host(`navidrome.service.dc1.consul`)",
|
||||
"traefik.http.routers.navidromewan.rule=Host(`m.fbleagh.duckdns.org`)",
|
||||
"traefik.http.routers.navidromewan.middlewares=dex@consulcatalog",
|
||||
"traefik.http.routers.navidromewan.tls=true",
|
||||
]
|
||||
port = "http"
|
||||
check {
|
||||
type = "tcp"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
network {
|
||||
port "http" {
|
||||
static = 4533
|
||||
to = 4533
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -8,7 +8,7 @@ job "navidrome-litefs" {
|
||||
}
|
||||
|
||||
group "navidrome" {
|
||||
count = 2
|
||||
count = 4
|
||||
|
||||
update {
|
||||
max_parallel = 1
|
||||
@@ -25,7 +25,7 @@ job "navidrome-litefs" {
|
||||
# Request static ports on the host
|
||||
port "http" {
|
||||
static = 4533
|
||||
to = 8080 # Maps host 4533 to container 8080 (LiteFS Proxy)
|
||||
to = 4533 # Direct to Navidrome
|
||||
}
|
||||
port "litefs" {
|
||||
static = 20202
|
||||
@@ -37,12 +37,11 @@ job "navidrome-litefs" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "gitea.service.dc1.fbleagh.duckdns.org/sstent/navidrome-litefs:latest"
|
||||
image = "gitea.service.dc1.fbleagh.duckdns.org/sstent/navidrome-litefs:e56fb94fdc0ac1f70abdb613b64ce6b4d7a770cf"
|
||||
privileged = true # Still needed for FUSE
|
||||
ports = ["http", "litefs"]
|
||||
force_pull = true
|
||||
|
||||
# Removed network_mode = "host"
|
||||
|
||||
volumes = [
|
||||
"/mnt/configs/navidrome_litefs:/var/lib/litefs",
|
||||
"/mnt/Public/configs/navidrome:/shared_data",
|
||||
@@ -56,45 +55,20 @@ job "navidrome-litefs" {
|
||||
# LiteFS Config
|
||||
CONSUL_URL = "http://${attr.unique.network.ip-address}:8500"
|
||||
ADVERTISE_IP = "${attr.unique.network.ip-address}"
|
||||
PORT = "8080" # Internal proxy port
|
||||
PORT = "8080" # Internal proxy port (unused but kept)
|
||||
|
||||
# Navidrome Config
|
||||
ND_DATAFOLDER = "/local/data"
|
||||
ND_CACHEFOLDER = "/shared_data/cache"
|
||||
ND_CONFIGFILE = "/local/data/navidrome.toml"
|
||||
ND_DATAFOLDER = "/data"
|
||||
ND_PLUGINS_FOLDER = "/shared_data/plugins"
|
||||
ND_CACHEFOLDER = "/shared_data/cache"
|
||||
ND_BACKUP_PATH = "/shared_data/backup"
|
||||
|
||||
# Database is on the LiteFS FUSE mount
|
||||
ND_DBPATH = "/data/navidrome.db?_busy_timeout=30000&_journal_mode=WAL&_foreign_keys=on&synchronous=NORMAL"
|
||||
|
||||
ND_SCANSCHEDULE = "0"
|
||||
ND_SCANNER_FSWATCHER_ENABLED = "false"
|
||||
ND_LOGLEVEL = "info"
|
||||
ND_REVERSEPROXYWHITELIST = "0.0.0.0/0"
|
||||
ND_REVERSEPROXYUSERHEADER = "X-Forwarded-User"
|
||||
}
|
||||
|
||||
service {
|
||||
name = "navidrome"
|
||||
tags = [
|
||||
"navidrome",
|
||||
"web",
|
||||
"traefik.enable=true",
|
||||
"urlprefix-/navidrome",
|
||||
"tools",
|
||||
"traefik.http.routers.navidromelan.rule=Host(`navidrome.service.dc1.consul`)",
|
||||
"traefik.http.routers.navidromewan.rule=Host(`m.fbleagh.duckdns.org`)",
|
||||
"traefik.http.routers.navidromewan.middlewares=dex@consulcatalog",
|
||||
"traefik.http.routers.navidromewan.tls=true",
|
||||
]
|
||||
port = "http"
|
||||
|
||||
check {
|
||||
type = "http"
|
||||
path = "/app"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
}
|
||||
}
|
||||
# NO service block here! Managed by register.sh inside the container.
|
||||
|
||||
resources {
|
||||
cpu = 500
|
||||
@@ -102,4 +76,4 @@ job "navidrome-litefs" {
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,35 +0,0 @@
|
||||
type = "csi"
|
||||
id = "navidrome-volume"
|
||||
name = "navidrome-volume"
|
||||
|
||||
# This UUID was generated during the Postgres storage format
|
||||
external_id = "56783f1f-d9c6-45fd-baec-56fa6c33776b"
|
||||
|
||||
capacity_min = "10GiB"
|
||||
capacity_max = "10GiB"
|
||||
|
||||
capability {
|
||||
access_mode = "multi-node-multi-writer"
|
||||
attachment_mode = "file-system"
|
||||
}
|
||||
|
||||
plugin_id = "juicefs0"
|
||||
|
||||
context {
|
||||
writeback = "false"
|
||||
delayed-write = "true"
|
||||
upload-delay = "1m"
|
||||
cache-size = "1024"
|
||||
buffer-size = "128"
|
||||
attr-cache = "60"
|
||||
entry-cache = "60"
|
||||
enable-mmap = "true"
|
||||
metacache = "true"
|
||||
}
|
||||
|
||||
secrets {
|
||||
name = "navidrome-volume"
|
||||
metaurl = "postgres://postgres:postgres@master.postgres.service.dc1.consul:5432/juicefs-navidrome"
|
||||
storage = "postgres"
|
||||
bucket = "postgres://postgres:postgres@master.postgres.service.dc1.consul:5432/juicefs-navidrome-storage"
|
||||
}
|
||||
20
nomad-config-check.nomad
Normal file
20
nomad-config-check.nomad
Normal file
@@ -0,0 +1,20 @@
|
||||
job "nomad-config-check" {
|
||||
datacenters = ["dc1"]
|
||||
type = "batch"
|
||||
|
||||
group "check" {
|
||||
count = 1
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
value = "odroid7"
|
||||
}
|
||||
|
||||
task "config" {
|
||||
driver = "raw_exec"
|
||||
config {
|
||||
command = "grep"
|
||||
args = ["-r", "disable_script_checks", "/etc/nomad.d/"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
34
scan.nomad
Normal file
34
scan.nomad
Normal file
@@ -0,0 +1,34 @@
|
||||
job "port-discovery" {
|
||||
datacenters = ["dc1"]
|
||||
type = "batch"
|
||||
|
||||
group "scan" {
|
||||
count = 1
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
value = "odroid6"
|
||||
}
|
||||
|
||||
task "scan" {
|
||||
driver = "docker"
|
||||
config {
|
||||
image = "busybox"
|
||||
network_mode = "host"
|
||||
command = "sh"
|
||||
args = ["local/scan.sh"]
|
||||
}
|
||||
template {
|
||||
data = <<EOF
|
||||
#!/bin/sh
|
||||
TARGET="192.168.4.227"
|
||||
for p in 8085 8086 8087; do
|
||||
echo "Testing $p..."
|
||||
nc -zv -w 3 $TARGET $p 2>&1 | grep -q "refused" && echo "MATCH: $p is AVAILABLE (Refused)"
|
||||
nc -zv -w 3 $TARGET $p 2>&1 | grep -q "succeeded" && echo "BUSY: $p is IN USE"
|
||||
done
|
||||
EOF
|
||||
destination = "local/scan.sh"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
11
scripts/cluster_status/Makefile
Normal file
11
scripts/cluster_status/Makefile
Normal file
@@ -0,0 +1,11 @@
|
||||
.PHONY: setup test run
|
||||
|
||||
setup:
|
||||
python3 -m venv .venv
|
||||
. .venv/bin/activate && pip install -r requirements.txt
|
||||
|
||||
test:
|
||||
. .venv/bin/activate && pytest -v --cov=.
|
||||
|
||||
run:
|
||||
. .venv/bin/activate && PYTHONPATH=. python3 cli.py
|
||||
0
scripts/cluster_status/__init__.py
Normal file
0
scripts/cluster_status/__init__.py
Normal file
53
scripts/cluster_status/cli.py
Executable file
53
scripts/cluster_status/cli.py
Executable file
@@ -0,0 +1,53 @@
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import sys
|
||||
import config
|
||||
import cluster_aggregator
|
||||
import output_formatter
|
||||
import nomad_client
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description="Monitor Navidrome LiteFS/Consul cluster status.")
|
||||
parser.add_argument("--consul-url", help="Override Consul API URL (default from env or hardcoded)")
|
||||
parser.add_argument("--no-color", action="store_true", help="Disable colorized output")
|
||||
parser.add_argument("--restart", help="Restart the allocation on the specified node")
|
||||
return parser.parse_args()
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Resolve Consul URL
|
||||
consul_url = config.get_consul_url(args.consul_url)
|
||||
|
||||
# Handle restart if requested
|
||||
if args.restart:
|
||||
print(f"Attempting to restart allocation on node: {args.restart}...")
|
||||
alloc_id = nomad_client.get_allocation_id(args.restart, "navidrome-litefs")
|
||||
if alloc_id:
|
||||
if nomad_client.restart_allocation(alloc_id):
|
||||
print(f"Successfully sent restart signal to allocation {alloc_id}")
|
||||
else:
|
||||
print(f"Failed to restart allocation {alloc_id}")
|
||||
else:
|
||||
print(f"Could not find allocation for node {args.restart}")
|
||||
print("-" * 30)
|
||||
|
||||
try:
|
||||
# Fetch and aggregate data
|
||||
cluster_data = cluster_aggregator.get_cluster_status(consul_url)
|
||||
|
||||
# Format and print output
|
||||
print(output_formatter.format_summary(cluster_data, use_color=not args.no_color))
|
||||
print("\n" + output_formatter.format_node_table(cluster_data["nodes"], use_color=not args.no_color))
|
||||
|
||||
# Diagnostics
|
||||
diagnostics = output_formatter.format_diagnostics(cluster_data["nodes"], use_color=not args.no_color)
|
||||
if diagnostics:
|
||||
print(diagnostics)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
96
scripts/cluster_status/cluster_aggregator.py
Normal file
96
scripts/cluster_status/cluster_aggregator.py
Normal file
@@ -0,0 +1,96 @@
|
||||
import consul_client
|
||||
import litefs_client
|
||||
import nomad_client
|
||||
|
||||
def get_cluster_status(consul_url, job_id="navidrome-litefs"):
|
||||
"""
|
||||
Aggregates cluster data from Nomad (Discovery), LiteFS (Role), and Consul (Routing Health).
|
||||
"""
|
||||
# 1. Discover all nodes via Nomad Allocations
|
||||
allocations = nomad_client.get_job_allocations(job_id)
|
||||
nomad_available = bool(nomad_client.get_node_map())
|
||||
|
||||
# 2. Get all Consul registrations for 'navidrome'
|
||||
consul_services = consul_client.get_cluster_services(consul_url)
|
||||
# Create a map for easy lookup by IP
|
||||
consul_map = {s["address"]: s for s in consul_services}
|
||||
|
||||
aggregated_nodes = []
|
||||
is_healthy = True
|
||||
primary_count = 0
|
||||
|
||||
for alloc in allocations:
|
||||
node_name = alloc["node"]
|
||||
address = alloc["ip"]
|
||||
alloc_id = alloc["id"]
|
||||
|
||||
# 3. Get LiteFS Status
|
||||
litefs_status = litefs_client.get_node_status(address, alloc_id=alloc_id)
|
||||
|
||||
# 4. Match with Consul info
|
||||
consul_info = consul_map.get(address)
|
||||
|
||||
node_data = {
|
||||
"node": node_name,
|
||||
"address": address,
|
||||
"alloc_id": alloc_id,
|
||||
"litefs_primary": litefs_status.get("is_primary", False),
|
||||
"candidate": litefs_status.get("candidate", False),
|
||||
"uptime": alloc.get("uptime", "N/A"),
|
||||
"replication_lag": litefs_status.get("replication_lag", "N/A"),
|
||||
"dbs": litefs_status.get("dbs", {}),
|
||||
"litefs_error": litefs_status.get("error"),
|
||||
"nomad_logs": None
|
||||
}
|
||||
|
||||
# Legacy compat for formatter
|
||||
node_data["active_dbs"] = list(node_data["dbs"].keys())
|
||||
|
||||
if node_data["litefs_primary"]:
|
||||
primary_count += 1
|
||||
node_data["role"] = "primary"
|
||||
else:
|
||||
node_data["role"] = "replica"
|
||||
|
||||
# 5. Determine Consul status
|
||||
if consul_info:
|
||||
node_data["status"] = consul_info["status"]
|
||||
node_data["check_output"] = consul_info["check_output"]
|
||||
if node_data["status"] != "passing":
|
||||
is_healthy = False
|
||||
node_data["nomad_logs"] = nomad_client.get_allocation_logs(alloc_id)
|
||||
else:
|
||||
# Not in Consul
|
||||
if node_data["litefs_primary"]:
|
||||
# If it's primary in LiteFS but not in Consul, that's an error (unless just started)
|
||||
node_data["status"] = "unregistered"
|
||||
is_healthy = False
|
||||
node_data["nomad_logs"] = nomad_client.get_allocation_logs(alloc_id)
|
||||
else:
|
||||
# Replicas are expected to be unregistered in the new model
|
||||
node_data["status"] = "standby"
|
||||
node_data["check_output"] = "Clean catalog (expected for replica)"
|
||||
|
||||
aggregated_nodes.append(node_data)
|
||||
|
||||
# Final health check
|
||||
health = "Healthy"
|
||||
if not is_healthy:
|
||||
health = "Unhealthy"
|
||||
|
||||
if primary_count == 0:
|
||||
health = "No Primary Detected"
|
||||
elif primary_count > 1:
|
||||
health = "Split Brain Detected (Multiple Primaries)"
|
||||
|
||||
# Global warning if no DBs found on any node
|
||||
all_dbs = [db for n in aggregated_nodes for db in n.get("active_dbs", [])]
|
||||
if not all_dbs:
|
||||
health = f"{health} (WARNING: No LiteFS Databases Found)"
|
||||
|
||||
return {
|
||||
"health": health,
|
||||
"nodes": aggregated_nodes,
|
||||
"primary_count": primary_count,
|
||||
"nomad_available": nomad_available
|
||||
}
|
||||
15
scripts/cluster_status/config.py
Normal file
15
scripts/cluster_status/config.py
Normal file
@@ -0,0 +1,15 @@
|
||||
import os
|
||||
|
||||
DEFAULT_CONSUL_URL = "http://consul.service.dc1.consul:8500"
|
||||
|
||||
def get_consul_url(url_arg=None):
|
||||
"""
|
||||
Resolves the Consul URL in the following order:
|
||||
1. CLI Argument (url_arg)
|
||||
2. Environment Variable (CONSUL_HTTP_ADDR)
|
||||
3. Default (http://localhost:8500)
|
||||
"""
|
||||
if url_arg:
|
||||
return url_arg
|
||||
|
||||
return os.environ.get("CONSUL_HTTP_ADDR", DEFAULT_CONSUL_URL)
|
||||
46
scripts/cluster_status/consul_client.py
Normal file
46
scripts/cluster_status/consul_client.py
Normal file
@@ -0,0 +1,46 @@
|
||||
import requests
|
||||
|
||||
def get_cluster_services(consul_url):
|
||||
"""
|
||||
Queries Consul health API for all 'navidrome' services.
|
||||
Returns a list of dictionaries with node info.
|
||||
"""
|
||||
services = []
|
||||
|
||||
url = f"{consul_url}/v1/health/service/navidrome"
|
||||
try:
|
||||
response = requests.get(url, timeout=5)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for item in data:
|
||||
node_name = item["Node"]["Node"]
|
||||
address = item["Node"]["Address"]
|
||||
port = item["Service"]["Port"]
|
||||
|
||||
# Determine overall status from checks and extract output
|
||||
checks = item.get("Checks", [])
|
||||
status = "passing"
|
||||
check_output = ""
|
||||
for check in checks:
|
||||
if check["Status"] != "passing":
|
||||
status = check["Status"]
|
||||
check_output = check.get("Output", "")
|
||||
break
|
||||
else:
|
||||
if not check_output:
|
||||
check_output = check.get("Output", "")
|
||||
|
||||
services.append({
|
||||
"node": node_name,
|
||||
"address": address,
|
||||
"port": port,
|
||||
"role": "primary", # If it's in Consul as 'navidrome', it's intended to be primary
|
||||
"status": status,
|
||||
"service_id": item["Service"]["ID"],
|
||||
"check_output": check_output
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"Error fetching navidrome services from Consul: {e}")
|
||||
|
||||
return services
|
||||
93
scripts/cluster_status/litefs_client.py
Normal file
93
scripts/cluster_status/litefs_client.py
Normal file
@@ -0,0 +1,93 @@
|
||||
import requests
|
||||
import nomad_client
|
||||
import re
|
||||
|
||||
def parse_litefs_status(output):
|
||||
"""
|
||||
Parses the output of 'litefs status'.
|
||||
"""
|
||||
status = {}
|
||||
|
||||
# Extract Primary
|
||||
primary_match = re.search(r"Primary:\s+(true|false)", output, re.IGNORECASE)
|
||||
if primary_match:
|
||||
status["is_primary"] = primary_match.group(1).lower() == "true"
|
||||
|
||||
# Extract Uptime
|
||||
uptime_match = re.search(r"Uptime:\s+(\S+)", output)
|
||||
if uptime_match:
|
||||
status["uptime"] = uptime_match.group(1)
|
||||
|
||||
# Extract Replication Lag
|
||||
lag_match = re.search(r"Replication Lag:\s+(\S+)", output)
|
||||
if lag_match:
|
||||
status["replication_lag"] = lag_match.group(1)
|
||||
|
||||
return status
|
||||
|
||||
def get_node_status(node_address, port=20202, alloc_id=None):
|
||||
"""
|
||||
Queries the LiteFS HTTP API on a specific node for its status.
|
||||
Tries /status first, then /debug/vars, then falls back to nomad alloc exec.
|
||||
"""
|
||||
# 1. Try /status
|
||||
url = f"http://{node_address}:{port}/status"
|
||||
try:
|
||||
response = requests.get(url, timeout=3)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
status = {
|
||||
"is_primary": data.get("primary", False),
|
||||
"uptime": data.get("uptime", 0),
|
||||
"advertise_url": data.get("advertiseURL", ""),
|
||||
"dbs": data.get("dbs", {})
|
||||
}
|
||||
if "replicationLag" in data:
|
||||
status["replication_lag"] = data["replicationLag"]
|
||||
if "primaryURL" in data:
|
||||
status["primary_url"] = data["primaryURL"]
|
||||
return status
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 2. Try /debug/vars
|
||||
url = f"http://{node_address}:{port}/debug/vars"
|
||||
try:
|
||||
response = requests.get(url, timeout=3)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
store = data.get("store", {})
|
||||
status = {
|
||||
"is_primary": store.get("isPrimary", False),
|
||||
"candidate": store.get("candidate", False),
|
||||
"uptime": "N/A", # Will be filled by Nomad uptime
|
||||
"advertise_url": f"http://{node_address}:{port}",
|
||||
"dbs": store.get("dbs", {})
|
||||
}
|
||||
if "replicationLag" in store:
|
||||
status["replication_lag"] = store["replicationLag"]
|
||||
return status
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 3. Fallback to nomad alloc exec
|
||||
if alloc_id:
|
||||
try:
|
||||
output = nomad_client.exec_command(alloc_id, ["litefs", "status"])
|
||||
if output and "Error" not in output:
|
||||
parsed_status = parse_litefs_status(output)
|
||||
if parsed_status:
|
||||
if "is_primary" not in parsed_status:
|
||||
parsed_status["is_primary"] = False
|
||||
if "uptime" not in parsed_status:
|
||||
parsed_status["uptime"] = "N/A"
|
||||
parsed_status["advertise_url"] = f"nomad://{alloc_id}"
|
||||
return parsed_status
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return {
|
||||
"error": "All status retrieval methods failed",
|
||||
"is_primary": False,
|
||||
"uptime": "N/A"
|
||||
}
|
||||
211
scripts/cluster_status/nomad_client.py
Normal file
211
scripts/cluster_status/nomad_client.py
Normal file
@@ -0,0 +1,211 @@
|
||||
import subprocess
|
||||
import re
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
|
||||
def get_node_map():
|
||||
"""
|
||||
Returns a mapping of Node ID to Node Name.
|
||||
"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["nomad", "node", "status"],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
lines = result.stdout.splitlines()
|
||||
node_map = {}
|
||||
for line in lines:
|
||||
if line.strip() and not line.startswith("ID") and not line.startswith("=="):
|
||||
parts = re.split(r"\s+", line.strip())
|
||||
if len(parts) >= 4:
|
||||
node_map[parts[0]] = parts[3]
|
||||
return node_map
|
||||
except FileNotFoundError:
|
||||
print("Warning: 'nomad' binary not found in PATH.", file=sys.stderr)
|
||||
return {}
|
||||
except subprocess.CalledProcessError as e:
|
||||
print(f"Warning: Failed to query Nomad nodes: {e}", file=sys.stderr)
|
||||
return {}
|
||||
except Exception as e:
|
||||
print(f"Error getting node map: {e}", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
def get_job_allocations(job_id):
|
||||
"""
|
||||
Returns a list of all active allocations for a job with their IPs and uptimes.
|
||||
"""
|
||||
try:
|
||||
# 1. Get list of allocations
|
||||
result = subprocess.run(
|
||||
["nomad", "job", "status", job_id],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
|
||||
alloc_ids = []
|
||||
lines = result.stdout.splitlines()
|
||||
start_parsing = False
|
||||
for line in lines:
|
||||
if "Allocations" in line:
|
||||
start_parsing = True
|
||||
continue
|
||||
if start_parsing and line.strip() and not line.startswith("ID") and not line.startswith("=="):
|
||||
parts = re.split(r"\s+", line.strip())
|
||||
if len(parts) >= 5:
|
||||
alloc_id = parts[0]
|
||||
# Status is usually the 6th or 8th column depending on verbose
|
||||
# We'll look for 'running' in any part from 3 onwards
|
||||
if any(p == "running" for p in parts[3:]):
|
||||
alloc_ids.append(alloc_id)
|
||||
|
||||
# 2. For each allocation, get its IP and Uptime
|
||||
allocations = []
|
||||
now = datetime.now(timezone.utc)
|
||||
|
||||
for alloc_id in alloc_ids:
|
||||
res_alloc = subprocess.run(
|
||||
["nomad", "alloc", "status", alloc_id],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
|
||||
node_name = ""
|
||||
ip = ""
|
||||
full_id = alloc_id
|
||||
uptime = "N/A"
|
||||
|
||||
for l in res_alloc.stdout.splitlines():
|
||||
if l.startswith("ID") and "=" in l:
|
||||
full_id = l.split("=")[1].strip()
|
||||
if l.startswith("Node Name") and "=" in l:
|
||||
node_name = l.split("=")[1].strip()
|
||||
# Extract IP from Allocation Addresses
|
||||
if "*litefs" in l:
|
||||
# e.g. "*litefs yes 1.1.1.1:20202 -> 20202"
|
||||
m = re.search(r"(\d+\.\d+\.\d+\.\d+):", l)
|
||||
if m:
|
||||
ip = m.group(1)
|
||||
|
||||
# Extract Uptime from Started At
|
||||
if "Started At" in l and "=" in l:
|
||||
# e.g. "Started At = 2026-02-09T14:04:28Z"
|
||||
ts_str = l.split("=")[1].strip()
|
||||
if ts_str and ts_str != "N/A":
|
||||
try:
|
||||
# Parse ISO timestamp
|
||||
started_at = datetime.fromisoformat(ts_str.replace("Z", "+00:00"))
|
||||
duration = now - started_at
|
||||
# Format duration
|
||||
secs = int(duration.total_seconds())
|
||||
if secs < 60:
|
||||
uptime = f"{secs}s"
|
||||
elif secs < 3600:
|
||||
uptime = f"{secs//60}m{secs%60}s"
|
||||
else:
|
||||
uptime = f"{secs//3600}h{(secs%3600)//60}m"
|
||||
except Exception:
|
||||
uptime = ts_str
|
||||
|
||||
allocations.append({
|
||||
"id": full_id,
|
||||
"node": node_name,
|
||||
"ip": ip,
|
||||
"uptime": uptime
|
||||
})
|
||||
|
||||
return allocations
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error getting job allocations: {e}", file=sys.stderr)
|
||||
return []
|
||||
|
||||
def get_allocation_id(node_name, job_id):
|
||||
"""
|
||||
Finds the FULL allocation ID for a specific node and job.
|
||||
"""
|
||||
node_map = get_node_map()
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["nomad", "job", "status", job_id],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
|
||||
lines = result.stdout.splitlines()
|
||||
start_parsing = False
|
||||
for line in lines:
|
||||
if "Allocations" in line:
|
||||
start_parsing = True
|
||||
continue
|
||||
if start_parsing and line.strip() and not line.startswith("ID") and not line.startswith("=="):
|
||||
parts = re.split(r"\s+", line.strip())
|
||||
if len(parts) >= 2:
|
||||
alloc_id = parts[0]
|
||||
node_id = parts[1]
|
||||
|
||||
resolved_name = node_map.get(node_id, "")
|
||||
if node_id == node_name or resolved_name == node_name:
|
||||
# Now get the FULL ID using nomad alloc status
|
||||
res_alloc = subprocess.run(
|
||||
["nomad", "alloc", "status", alloc_id],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
for l in res_alloc.stdout.splitlines():
|
||||
if l.startswith("ID"):
|
||||
return l.split("=")[1].strip()
|
||||
return alloc_id
|
||||
|
||||
except FileNotFoundError:
|
||||
return None # Warning already printed by get_node_map likely
|
||||
except Exception as e:
|
||||
print(f"Error getting allocation ID: {e}", file=sys.stderr)
|
||||
|
||||
return None
|
||||
|
||||
def get_allocation_logs(alloc_id, tail=20):
|
||||
"""
|
||||
Fetches the last N lines of stderr for an allocation.
|
||||
"""
|
||||
try:
|
||||
# Try with task name first, then without
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["nomad", "alloc", "logs", "-stderr", "-task", "navidrome", "-n", str(tail), alloc_id],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
return result.stdout
|
||||
except subprocess.CalledProcessError:
|
||||
result = subprocess.run(
|
||||
["nomad", "alloc", "logs", "-stderr", "-n", str(tail), alloc_id],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
return result.stdout
|
||||
except Exception as e:
|
||||
# Don't print stack trace, just the error
|
||||
return f"Nomad Error: {str(e)}"
|
||||
|
||||
def exec_command(alloc_id, command, task="navidrome"):
|
||||
"""
|
||||
Executes a command inside a specific allocation and task.
|
||||
"""
|
||||
try:
|
||||
args = ["nomad", "alloc", "exec", "-task", task, alloc_id] + command
|
||||
result = subprocess.run(
|
||||
args,
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
return result.stdout
|
||||
except Exception as e:
|
||||
# Don't print stack trace, just return error string
|
||||
return f"Nomad Error: {str(e)}"
|
||||
|
||||
def restart_allocation(alloc_id):
|
||||
"""
|
||||
Restarts a specific allocation.
|
||||
"""
|
||||
try:
|
||||
subprocess.run(
|
||||
["nomad", "alloc", "restart", alloc_id],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"Error restarting allocation: {e}", file=sys.stderr)
|
||||
return False
|
||||
142
scripts/cluster_status/output_formatter.py
Normal file
142
scripts/cluster_status/output_formatter.py
Normal file
@@ -0,0 +1,142 @@
|
||||
from tabulate import tabulate
|
||||
|
||||
# ANSI Color Codes
|
||||
GREEN = "\033[92m"
|
||||
RED = "\033[91m"
|
||||
CYAN = "\033[96m"
|
||||
YELLOW = "\033[93m"
|
||||
RESET = "\033[0m"
|
||||
BOLD = "\033[1m"
|
||||
|
||||
def colorize(text, color, use_color=True):
|
||||
if not use_color:
|
||||
return text
|
||||
return f"{color}{text}{RESET}"
|
||||
|
||||
def format_summary(cluster_data, use_color=True):
|
||||
"""
|
||||
Formats the cluster health summary.
|
||||
"""
|
||||
health = cluster_data["health"]
|
||||
color = GREEN if health == "Healthy" else RED
|
||||
if health == "Split Brain Detected (Multiple Primaries)":
|
||||
color = YELLOW
|
||||
|
||||
summary = [
|
||||
f"{BOLD}Cluster Health:{RESET} {colorize(health, color, use_color)}",
|
||||
f"{BOLD}Total Nodes:{RESET} {len(cluster_data['nodes'])}",
|
||||
f"{BOLD}Primaries:{RESET} {cluster_data['primary_count']}",
|
||||
]
|
||||
|
||||
if not cluster_data.get("nomad_available", True):
|
||||
summary.append(colorize("WARNING: Nomad CLI unavailable or connectivity failed. Logs and uptime may be missing.", RED, use_color))
|
||||
|
||||
summary.append("-" * 30)
|
||||
return "\n".join(summary)
|
||||
|
||||
def format_node_table(nodes, use_color=True):
|
||||
"""
|
||||
Formats the node list as a table.
|
||||
"""
|
||||
headers = ["Node", "Role", "Consul Status", "LiteFS Role", "Cand", "Uptime", "Lag", "DBs", "LiteFS Info"]
|
||||
table_data = []
|
||||
|
||||
for node in nodes:
|
||||
# Consul status color
|
||||
status = node["status"]
|
||||
if status == "passing":
|
||||
status_color = GREEN
|
||||
elif status == "standby":
|
||||
status_color = CYAN
|
||||
elif status == "unregistered":
|
||||
status_color = YELLOW
|
||||
else:
|
||||
status_color = RED
|
||||
|
||||
colored_status = colorize(status, status_color, use_color)
|
||||
|
||||
# Role color
|
||||
role = node["role"]
|
||||
role_color = CYAN if role == "primary" else RESET
|
||||
colored_role = colorize(role, role_color, use_color)
|
||||
|
||||
# LiteFS role color & consistency check
|
||||
litefs_primary = node["litefs_primary"]
|
||||
litefs_role = "primary" if litefs_primary else "replica"
|
||||
|
||||
# Candidate status
|
||||
candidate = "✓" if node.get("candidate") else "✗"
|
||||
candidate_color = GREEN if node.get("candidate") else RESET
|
||||
colored_candidate = colorize(candidate, candidate_color, use_color)
|
||||
|
||||
# Highlight discrepancy if consul and litefs disagree
|
||||
litefs_role_color = RESET
|
||||
if (role == "primary" and not litefs_primary) or (role == "replica" and litefs_primary):
|
||||
litefs_role_color = YELLOW
|
||||
litefs_role = f"!! {litefs_role} !!"
|
||||
elif litefs_primary:
|
||||
litefs_role_color = CYAN
|
||||
|
||||
colored_litefs_role = colorize(litefs_role, litefs_role_color, use_color)
|
||||
|
||||
# Database details (name, txid, checksum)
|
||||
db_infos = []
|
||||
dbs = node.get("dbs", {})
|
||||
for db_name, db_data in dbs.items():
|
||||
txid = db_data.get("txid", "0")
|
||||
chk = db_data.get("checksum", "0")
|
||||
# Only show first 8 chars of checksum for space
|
||||
db_infos.append(f"{db_name} (tx:{int(txid, 16)}, chk:{chk[:8]})")
|
||||
|
||||
db_str = "\n".join(db_infos) if db_infos else "None"
|
||||
|
||||
# Error info
|
||||
info = ""
|
||||
if node.get("litefs_error"):
|
||||
info = colorize("LiteFS API Error", RED, use_color)
|
||||
else:
|
||||
info = node.get("address", "")
|
||||
|
||||
table_data.append([
|
||||
colorize(node["node"], BOLD, use_color),
|
||||
colored_role,
|
||||
colored_status,
|
||||
colored_litefs_role,
|
||||
colored_candidate,
|
||||
node.get("uptime", "N/A"),
|
||||
node.get("replication_lag", "N/A"),
|
||||
db_str,
|
||||
info
|
||||
])
|
||||
|
||||
return tabulate(table_data, headers=headers, tablefmt="simple")
|
||||
|
||||
def format_diagnostics(nodes, use_color=True):
|
||||
"""
|
||||
Formats detailed diagnostic information for nodes with errors.
|
||||
"""
|
||||
# Only show diagnostics if status is critical/unregistered OR if there is a LiteFS error
|
||||
# Ignore 'standby' since it is expected for replicas
|
||||
error_nodes = [n for n in nodes if (n["status"] not in ["passing", "standby"]) or n.get("litefs_error")]
|
||||
|
||||
if not error_nodes:
|
||||
return ""
|
||||
|
||||
output = ["", colorize("DIAGNOSTICS", BOLD, use_color), "=" * 20]
|
||||
|
||||
for node in error_nodes:
|
||||
output.append(f"\n{BOLD}Node:{RESET} {colorize(node['node'], RED, use_color)}")
|
||||
|
||||
if node["status"] != "passing":
|
||||
output.append(f" {BOLD}Consul Check Status:{RESET} {colorize(node['status'], RED, use_color)}")
|
||||
if node.get("check_output"):
|
||||
output.append(f" {BOLD}Consul Check Output:{RESET}\n {node['check_output'].strip()}")
|
||||
|
||||
if node.get("nomad_logs"):
|
||||
output.append(f" {BOLD}Nomad Stderr Logs (last 20 lines):{RESET}\n{node['nomad_logs']}")
|
||||
|
||||
if node.get("litefs_error"):
|
||||
output.append(f" {BOLD}LiteFS API Error:{RESET} {colorize(node['litefs_error'], RED, use_color)}")
|
||||
|
||||
return "\n".join(output)
|
||||
|
||||
4
scripts/cluster_status/requirements.txt
Normal file
4
scripts/cluster_status/requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
requests
|
||||
tabulate
|
||||
pytest
|
||||
pytest-cov
|
||||
0
scripts/cluster_status/tests/__init__.py
Normal file
0
scripts/cluster_status/tests/__init__.py
Normal file
61
scripts/cluster_status/tests/test_aggregator.py
Normal file
61
scripts/cluster_status/tests/test_aggregator.py
Normal file
@@ -0,0 +1,61 @@
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
import cluster_aggregator
|
||||
|
||||
@patch("consul_client.get_cluster_services")
|
||||
@patch("litefs_client.get_node_status")
|
||||
@patch("nomad_client.get_job_allocations")
|
||||
@patch("nomad_client.get_node_map")
|
||||
def test_aggregate_cluster_status(mock_node_map, mock_nomad_allocs, mock_litefs, mock_consul):
|
||||
"""Test aggregating Nomad, Consul and LiteFS data."""
|
||||
mock_node_map.return_value = {"id": "name"}
|
||||
# Mock Nomad allocations
|
||||
mock_nomad_allocs.return_value = [
|
||||
{"id": "alloc1", "node": "node1", "ip": "1.1.1.1"},
|
||||
{"id": "alloc2", "node": "node2", "ip": "1.1.1.2"}
|
||||
]
|
||||
|
||||
# Mock Consul data (only node1 is registered as primary)
|
||||
mock_consul.return_value = [
|
||||
{"node": "node1", "address": "1.1.1.1", "role": "primary", "status": "passing", "check_output": "OK"}
|
||||
]
|
||||
|
||||
# Mock LiteFS data
|
||||
def litefs_side_effect(addr, **kwargs):
|
||||
if addr == "1.1.1.1":
|
||||
return {"is_primary": True, "candidate": True, "uptime": 100, "dbs": {"db1": {"txid": "0000000000000001", "checksum": "abc"}}}
|
||||
return {"is_primary": False, "candidate": True, "uptime": 50, "dbs": {"db1": {"txid": "0000000000000001", "checksum": "abc"}}}
|
||||
|
||||
mock_litefs.side_effect = litefs_side_effect
|
||||
|
||||
cluster_data = cluster_aggregator.get_cluster_status("http://consul:8500")
|
||||
|
||||
assert cluster_data["health"] == "Healthy"
|
||||
assert len(cluster_data["nodes"]) == 2
|
||||
|
||||
node1 = next(n for n in cluster_data["nodes"] if n["node"] == "node1")
|
||||
assert node1["litefs_primary"] is True
|
||||
assert node1["candidate"] is True
|
||||
assert "db1" in node1["dbs"]
|
||||
|
||||
@patch("consul_client.get_cluster_services")
|
||||
@patch("litefs_client.get_node_status")
|
||||
@patch("nomad_client.get_job_allocations")
|
||||
@patch("nomad_client.get_allocation_logs")
|
||||
@patch("nomad_client.get_node_map")
|
||||
def test_aggregate_cluster_status_unhealthy(mock_node_map, mock_nomad_logs, mock_nomad_allocs, mock_litefs, mock_consul):
|
||||
"""Test health calculation when primary is unregistered or failing."""
|
||||
mock_node_map.return_value = {"id": "name"}
|
||||
mock_nomad_allocs.return_value = [
|
||||
{"id": "alloc1", "node": "node1", "ip": "1.1.1.1"}
|
||||
]
|
||||
# Primary in LiteFS but missing in Consul
|
||||
mock_litefs.return_value = {"is_primary": True, "candidate": True, "uptime": 100, "dbs": {"db1": {"txid": "1", "checksum": "abc"}}}
|
||||
mock_consul.return_value = []
|
||||
mock_nomad_logs.return_code = 0
|
||||
mock_nomad_logs.return_value = "error logs"
|
||||
|
||||
cluster_data = cluster_aggregator.get_cluster_status("http://consul:8500")
|
||||
assert cluster_data["health"] == "Unhealthy"
|
||||
assert cluster_data["nodes"][0]["status"] == "unregistered"
|
||||
assert cluster_data["nodes"][0]["nomad_logs"] == "error logs"
|
||||
27
scripts/cluster_status/tests/test_config.py
Normal file
27
scripts/cluster_status/tests/test_config.py
Normal file
@@ -0,0 +1,27 @@
|
||||
import os
|
||||
import pytest
|
||||
import config
|
||||
|
||||
def test_default_consul_url():
|
||||
"""Test that the default Consul URL is returned when no env var is set."""
|
||||
# Ensure env var is not set
|
||||
if "CONSUL_HTTP_ADDR" in os.environ:
|
||||
del os.environ["CONSUL_HTTP_ADDR"]
|
||||
|
||||
assert config.get_consul_url() == "http://consul.service.dc1.consul:8500"
|
||||
|
||||
def test_env_var_consul_url():
|
||||
"""Test that the environment variable overrides the default."""
|
||||
os.environ["CONSUL_HTTP_ADDR"] = "http://10.0.0.1:8500"
|
||||
try:
|
||||
assert config.get_consul_url() == "http://10.0.0.1:8500"
|
||||
finally:
|
||||
del os.environ["CONSUL_HTTP_ADDR"]
|
||||
|
||||
def test_cli_arg_consul_url():
|
||||
"""Test that the CLI argument overrides everything."""
|
||||
os.environ["CONSUL_HTTP_ADDR"] = "http://10.0.0.1:8500"
|
||||
try:
|
||||
assert config.get_consul_url("http://cli-override:8500") == "http://cli-override:8500"
|
||||
finally:
|
||||
del os.environ["CONSUL_HTTP_ADDR"]
|
||||
53
scripts/cluster_status/tests/test_consul_client.py
Normal file
53
scripts/cluster_status/tests/test_consul_client.py
Normal file
@@ -0,0 +1,53 @@
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
import consul_client
|
||||
import requests
|
||||
|
||||
@patch("requests.get")
|
||||
def test_get_cluster_services(mock_get):
|
||||
"""Test fetching healthy services from Consul."""
|
||||
# Mock responses for navidrome
|
||||
mock_navidrome = [
|
||||
{
|
||||
"Node": {"Node": "node1", "Address": "192.168.1.101"},
|
||||
"Service": {"Service": "navidrome", "Port": 4533, "ID": "navidrome-1"},
|
||||
"Checks": [{"Status": "passing"}]
|
||||
}
|
||||
]
|
||||
|
||||
m = MagicMock()
|
||||
m.json.return_value = mock_navidrome
|
||||
m.raise_for_status.return_value = None
|
||||
mock_get.return_value = m
|
||||
|
||||
consul_url = "http://consul:8500"
|
||||
services = consul_client.get_cluster_services(consul_url)
|
||||
|
||||
# Should find 1 node (primary)
|
||||
assert len(services) == 1
|
||||
assert services[0]["node"] == "node1"
|
||||
assert services[0]["status"] == "passing"
|
||||
|
||||
@patch("requests.get")
|
||||
def test_get_cluster_services_with_errors(mock_get):
|
||||
"""Test fetching services with detailed health check output."""
|
||||
mock_navidrome = [
|
||||
{
|
||||
"Node": {"Node": "node1", "Address": "192.168.1.101"},
|
||||
"Service": {"Service": "navidrome", "Port": 4533, "ID": "navidrome-1"},
|
||||
"Checks": [
|
||||
{"Status": "critical", "Output": "HTTP GET http://192.168.1.101:4533/app: 500 Internal Server Error"}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
m = MagicMock()
|
||||
m.json.return_value = mock_navidrome
|
||||
m.raise_for_status.return_value = None
|
||||
mock_get.return_value = m
|
||||
|
||||
services = consul_client.get_cluster_services("http://consul:8500")
|
||||
|
||||
node1 = next(s for s in services if s["node"] == "node1")
|
||||
assert node1["status"] == "critical"
|
||||
assert "500 Internal Server Error" in node1["check_output"]
|
||||
82
scripts/cluster_status/tests/test_formatter.py
Normal file
82
scripts/cluster_status/tests/test_formatter.py
Normal file
@@ -0,0 +1,82 @@
|
||||
import pytest
|
||||
import output_formatter
|
||||
|
||||
def test_format_cluster_summary():
|
||||
"""Test the summary string generation."""
|
||||
cluster_data = {
|
||||
"health": "Healthy",
|
||||
"primary_count": 1,
|
||||
"nodes": [],
|
||||
"nomad_available": False
|
||||
}
|
||||
summary = output_formatter.format_summary(cluster_data)
|
||||
assert "Healthy" in summary
|
||||
assert "Primaries" in summary
|
||||
assert "WARNING: Nomad CLI unavailable" in summary
|
||||
|
||||
def test_format_node_table():
|
||||
"""Test the table generation."""
|
||||
nodes = [
|
||||
{
|
||||
"node": "node1",
|
||||
"role": "primary",
|
||||
"status": "passing",
|
||||
"candidate": True,
|
||||
"uptime": "1h",
|
||||
"replication_lag": "N/A",
|
||||
"litefs_primary": True,
|
||||
"dbs": {"db1": {"txid": "1", "checksum": "abc"}}
|
||||
}
|
||||
]
|
||||
table = output_formatter.format_node_table(nodes, use_color=False)
|
||||
assert "node1" in table
|
||||
assert "primary" in table
|
||||
assert "passing" in table
|
||||
assert "db1" in table
|
||||
assert "Cand" in table
|
||||
|
||||
def test_format_diagnostics():
|
||||
"""Test the diagnostics section generation."""
|
||||
nodes = [
|
||||
{
|
||||
"node": "node3",
|
||||
"status": "critical",
|
||||
"check_output": "500 Internal Error",
|
||||
"litefs_error": "Connection Timeout"
|
||||
}
|
||||
]
|
||||
diagnostics = output_formatter.format_diagnostics(nodes, use_color=False)
|
||||
assert "DIAGNOSTICS" in diagnostics
|
||||
assert "node3" in diagnostics
|
||||
assert "500 Internal Error" in diagnostics
|
||||
assert "Connection Timeout" in diagnostics
|
||||
|
||||
def test_format_diagnostics_empty():
|
||||
"""Test that diagnostics section is empty when no errors exist."""
|
||||
nodes = [
|
||||
{
|
||||
"node": "node1",
|
||||
"status": "passing",
|
||||
"litefs_error": None
|
||||
},
|
||||
{
|
||||
"node": "node2",
|
||||
"status": "standby", # Should also be empty
|
||||
"litefs_error": None
|
||||
}
|
||||
]
|
||||
diagnostics = output_formatter.format_diagnostics(nodes, use_color=False)
|
||||
assert diagnostics == ""
|
||||
|
||||
def test_format_node_table_status_colors():
|
||||
"""Test that different statuses are handled."""
|
||||
nodes = [
|
||||
{"node": "n1", "role": "primary", "status": "passing", "litefs_primary": True},
|
||||
{"node": "n2", "role": "replica", "status": "standby", "litefs_primary": False},
|
||||
{"node": "n3", "role": "primary", "status": "unregistered", "litefs_primary": True},
|
||||
]
|
||||
table = output_formatter.format_node_table(nodes, use_color=False)
|
||||
assert "passing" in table
|
||||
assert "standby" in table
|
||||
assert "unregistered" in table
|
||||
|
||||
84
scripts/cluster_status/tests/test_litefs_client.py
Normal file
84
scripts/cluster_status/tests/test_litefs_client.py
Normal file
@@ -0,0 +1,84 @@
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
import litefs_client
|
||||
|
||||
@patch("requests.get")
|
||||
def test_get_node_status_primary(mock_get):
|
||||
"""Test fetching LiteFS status for a primary node via /status."""
|
||||
mock_status = {
|
||||
"clusterID": "cid1",
|
||||
"primary": True,
|
||||
"uptime": 3600,
|
||||
"advertiseURL": "http://192.168.1.101:20202"
|
||||
}
|
||||
|
||||
m = MagicMock()
|
||||
m.status_code = 200
|
||||
m.json.return_value = mock_status
|
||||
mock_get.return_value = m
|
||||
|
||||
status = litefs_client.get_node_status("192.168.1.101")
|
||||
|
||||
assert status["is_primary"] is True
|
||||
assert status["uptime"] == 3600
|
||||
assert status["advertise_url"] == "http://192.168.1.101:20202"
|
||||
|
||||
@patch("requests.get")
|
||||
def test_get_node_status_fallback(mock_get):
|
||||
"""Test fetching LiteFS status via /debug/vars fallback."""
|
||||
def side_effect(url, **kwargs):
|
||||
m = MagicMock()
|
||||
if "/status" in url:
|
||||
m.status_code = 404
|
||||
return m
|
||||
elif "/debug/vars" in url:
|
||||
m.status_code = 200
|
||||
m.json.return_value = {
|
||||
"store": {"isPrimary": True}
|
||||
}
|
||||
return m
|
||||
return m
|
||||
|
||||
mock_get.side_effect = side_effect
|
||||
|
||||
status = litefs_client.get_node_status("192.168.1.101")
|
||||
|
||||
assert status["is_primary"] is True
|
||||
assert status["uptime"] == "N/A"
|
||||
assert status["advertise_url"] == "http://192.168.1.101:20202"
|
||||
|
||||
@patch("requests.get")
|
||||
def test_get_node_status_error(mock_get):
|
||||
"""Test fetching LiteFS status with a connection error."""
|
||||
mock_get.side_effect = Exception("Connection failed")
|
||||
|
||||
status = litefs_client.get_node_status("192.168.1.101")
|
||||
|
||||
assert "error" in status
|
||||
assert status["is_primary"] is False
|
||||
|
||||
@patch("nomad_client.exec_command")
|
||||
def test_get_node_status_nomad_exec(mock_exec):
|
||||
"""Test fetching LiteFS status via nomad alloc exec."""
|
||||
# Mock LiteFS status output (text format)
|
||||
mock_status_output = """
|
||||
Config:
|
||||
Path: /etc/litefs.yml
|
||||
...
|
||||
Status:
|
||||
Primary: true
|
||||
Uptime: 1h5m10s
|
||||
Replication Lag: 0s
|
||||
"""
|
||||
mock_exec.return_value = mock_status_output
|
||||
|
||||
# We need to mock requests.get to fail first
|
||||
with patch("requests.get") as mock_get:
|
||||
mock_get.side_effect = Exception("HTTP failed")
|
||||
|
||||
status = litefs_client.get_node_status("1.1.1.1", alloc_id="abc12345")
|
||||
|
||||
assert status["is_primary"] is True
|
||||
assert status["uptime"] == "1h5m10s"
|
||||
# Since it's primary, lag might not be shown or be 0
|
||||
assert status["replication_lag"] == "0s"
|
||||
19
scripts/cluster_status/tests/test_main.py
Normal file
19
scripts/cluster_status/tests/test_main.py
Normal file
@@ -0,0 +1,19 @@
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
import cli
|
||||
import sys
|
||||
|
||||
def test_arg_parsing_default():
|
||||
"""Test that default arguments are parsed correctly."""
|
||||
with patch.object(sys, 'argv', ['cli.py']):
|
||||
args = cli.parse_args()
|
||||
assert args.consul_url is None # Should use config default later
|
||||
assert args.no_color is False
|
||||
|
||||
def test_arg_parsing_custom():
|
||||
"""Test that custom arguments are parsed correctly."""
|
||||
with patch.object(sys, 'argv', ['cli.py', '--consul-url', 'http://custom:8500', '--no-color', '--restart', 'node1']):
|
||||
args = cli.parse_args()
|
||||
assert args.consul_url == 'http://custom:8500'
|
||||
assert args.no_color is True
|
||||
assert args.restart == 'node1'
|
||||
135
scripts/cluster_status/tests/test_nomad_client.py
Normal file
135
scripts/cluster_status/tests/test_nomad_client.py
Normal file
@@ -0,0 +1,135 @@
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
import nomad_client
|
||||
import subprocess
|
||||
|
||||
@patch("subprocess.run")
|
||||
@patch("nomad_client.get_node_map")
|
||||
def test_get_allocation_id(mock_node_map, mock_run):
|
||||
"""Test getting allocation ID for a node."""
|
||||
mock_node_map.return_value = {"node_id1": "node1"}
|
||||
|
||||
# Mock 'nomad job status navidrome-litefs' output
|
||||
mock_job_status = MagicMock()
|
||||
mock_job_status.stdout = """
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created Modified
|
||||
abc12345 node_id1 navidrome 1 run running 1h ago 1h ago
|
||||
"""
|
||||
|
||||
# Mock 'nomad alloc status abc12345' output
|
||||
mock_alloc_status = MagicMock()
|
||||
mock_alloc_status.stdout = "ID = abc12345-full-id"
|
||||
|
||||
mock_run.side_effect = [mock_job_status, mock_alloc_status]
|
||||
|
||||
alloc_id = nomad_client.get_allocation_id("node1", "navidrome-litefs")
|
||||
assert alloc_id == "abc12345-full-id"
|
||||
|
||||
@patch("subprocess.run")
|
||||
def test_get_logs(mock_run):
|
||||
"""Test fetching logs for an allocation."""
|
||||
mock_stderr = "Error: database is locked\nSome other error"
|
||||
m = MagicMock()
|
||||
m.stdout = mock_stderr
|
||||
m.return_code = 0
|
||||
mock_run.return_value = m
|
||||
|
||||
logs = nomad_client.get_allocation_logs("abc12345", tail=20)
|
||||
assert "database is locked" in logs
|
||||
# It should have tried with -task navidrome first
|
||||
mock_run.assert_any_call(
|
||||
["nomad", "alloc", "logs", "-stderr", "-task", "navidrome", "-n", "20", "abc12345"],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
|
||||
@patch("subprocess.run")
|
||||
def test_restart_allocation(mock_run):
|
||||
"""Test restarting an allocation."""
|
||||
m = MagicMock()
|
||||
m.return_code = 0
|
||||
mock_run.return_value = m
|
||||
|
||||
success = nomad_client.restart_allocation("abc12345")
|
||||
assert success is True
|
||||
mock_run.assert_called_with(
|
||||
["nomad", "alloc", "restart", "abc12345"],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
|
||||
@patch("subprocess.run")
|
||||
def test_exec_command(mock_run):
|
||||
"""Test executing a command in an allocation."""
|
||||
m = MagicMock()
|
||||
m.stdout = "Command output"
|
||||
m.return_code = 0
|
||||
mock_run.return_value = m
|
||||
|
||||
output = nomad_client.exec_command("abc12345", ["ls", "/data"])
|
||||
assert output == "Command output"
|
||||
mock_run.assert_called_with(
|
||||
["nomad", "alloc", "exec", "-task", "navidrome", "abc12345", "ls", "/data"],
|
||||
capture_output=True, text=True, check=True
|
||||
)
|
||||
|
||||
@patch("subprocess.run")
|
||||
def test_exec_command_failure(mock_run):
|
||||
"""Test executing a command handles failure gracefully."""
|
||||
mock_run.side_effect = subprocess.CalledProcessError(1, "nomad", stderr="Nomad error")
|
||||
|
||||
output = nomad_client.exec_command("abc12345", ["ls", "/data"])
|
||||
assert "Nomad Error" in output
|
||||
assert "Nomad error" not in output # The exception str might not contain stderr directly depending on python version
|
||||
|
||||
@patch("subprocess.run")
|
||||
def test_get_node_map_failure(mock_run):
|
||||
"""Test get_node_map handles failure."""
|
||||
mock_run.side_effect = FileNotFoundError("No such file")
|
||||
|
||||
# It should not raise
|
||||
node_map = nomad_client.get_node_map()
|
||||
assert node_map == {}
|
||||
|
||||
@patch("subprocess.run")
|
||||
def test_get_job_allocations(mock_run):
|
||||
"""Test getting all allocations for a job with their IPs."""
|
||||
# Mock 'nomad job status navidrome-litefs'
|
||||
mock_job_status = MagicMock()
|
||||
mock_job_status.stdout = """
|
||||
Allocations
|
||||
ID Node ID Node Name Status Created
|
||||
abc12345 node1 host1 running 1h ago
|
||||
def67890 node2 host2 running 1h ago
|
||||
"""
|
||||
|
||||
# Mock 'nomad alloc status' for each alloc
|
||||
mock_alloc1 = MagicMock()
|
||||
mock_alloc1.stdout = """
|
||||
ID = abc12345
|
||||
Node Name = host1
|
||||
Allocation Addresses:
|
||||
Label Dynamic Address
|
||||
*http yes 1.1.1.1:4533 -> 4533
|
||||
*litefs yes 1.1.1.1:20202 -> 20202
|
||||
Task Events:
|
||||
Started At = 2026-02-09T14:00:00Z
|
||||
"""
|
||||
mock_alloc2 = MagicMock()
|
||||
mock_alloc2.stdout = """
|
||||
ID = def67890
|
||||
Node Name = host2
|
||||
Allocation Addresses:
|
||||
Label Dynamic Address
|
||||
*http yes 2.2.2.2:4533 -> 4533
|
||||
*litefs yes 2.2.2.2:20202 -> 20202
|
||||
Task Events:
|
||||
Started At = 2026-02-09T14:00:00Z
|
||||
"""
|
||||
|
||||
mock_run.side_effect = [mock_job_status, mock_alloc1, mock_alloc2]
|
||||
|
||||
allocs = nomad_client.get_job_allocations("navidrome-litefs")
|
||||
assert len(allocs) == 2
|
||||
assert allocs[0]["ip"] == "1.1.1.1"
|
||||
assert "uptime" in allocs[0]
|
||||
assert allocs[0]["uptime"] != "N/A"
|
||||
Reference in New Issue
Block a user