feat: Add automated LiteFS backups and GitHub deployment workflow
All checks were successful
Build and Push Docker Image / build-and-push (push) Successful in 3m52s
All checks were successful
Build and Push Docker Image / build-and-push (push) Successful in 3m52s
This commit is contained in:
@@ -1,30 +0,0 @@
|
||||
# Plan: Cluster Diagnosis and Script Enhancement (`diagnose_and_enhance`)
|
||||
|
||||
## Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]
|
||||
- [x] Task: Update `consul_client.py` to fetch detailed health check output
|
||||
- [x] Write tests for fetching `Output` field from Consul checks
|
||||
- [x] Implement logic to extract and store the `Output` (error message)
|
||||
- [x] Task: Update aggregator and formatter to display Consul errors
|
||||
- [x] Update aggregation logic to include `consul_error`
|
||||
- [x] Update table formatter to indicate an error (maybe a flag or color)
|
||||
- [x] Add a "Diagnostics" section to the output to print full error details
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]
|
||||
- [x] Task: Implement `nomad_client.py` wrapper
|
||||
- [x] Write tests for `get_allocation_logs`, `get_node_status`, and `restart_allocation` (mocking subprocess)
|
||||
- [x] Implement `subprocess.run(["nomad", ...])` logic to fetch logs and restart allocations
|
||||
- [x] Task: Integrate Nomad logs into diagnosis
|
||||
- [x] Update aggregator to call Nomad client for critical nodes
|
||||
- [x] Update "Diagnostics" section to display the last 20 lines of stderr
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Advanced LiteFS Status [ ]
|
||||
- [ ] Task: Implement `litefs_status` via `nomad alloc exec`
|
||||
- [ ] Write tests for executing remote commands via Nomad
|
||||
- [ ] Update `litefs_client.py` to fallback to `nomad alloc exec` if HTTP fails
|
||||
- [ ] Parse `litefs status` output (text/json) to extract uptime and replication lag
|
||||
- [ ] Task: Final Polish and Diagnosis Run
|
||||
- [ ] Ensure all pieces work together
|
||||
- [ ] Run the script to diagnose `odroid8`
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)
|
||||
@@ -1,22 +0,0 @@
|
||||
# Plan: Fix LiteFS Configuration and Process Management (`fix_litefs_config`)
|
||||
|
||||
## Phase 1: Configuration and Image Structure [ ]
|
||||
- [x] Task: Update `litefs.yml` to include the `exec` block (396dfeb)
|
||||
- [x] Task: Update `Dockerfile` to use LiteFS as the supervisor (`ENTRYPOINT ["litefs", "mount"]`) (ef91b8e)
|
||||
- [x] Task: Update `navidrome-litefs-v2.nomad` with corrected storage paths (`ND_DATAFOLDER`, `ND_CACHEFOLDER`, `ND_BACKUP_PATH`) (5cbb657)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 1: Configuration and Image Structure' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Entrypoint and Registration Logic [x] [checkpoint: 9cd5455]
|
||||
- [x] Task: Refactor `entrypoint.sh` to handle leadership-aware process management (9cd5455)
|
||||
- [x] Integrate Consul registration logic (from `register.sh`)
|
||||
- [x] Implement loop to start/stop Navidrome based on `/data/.primary` existence
|
||||
- [x] Ensure proper signal handling for Navidrome shutdown
|
||||
- [x] Task: Clean up redundant scripts (e.g., `register.sh` if fully integrated) (9cd5455)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 2: Entrypoint and Registration Logic' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Failover Verification [ ]
|
||||
- [ ] Task: Build and push the updated Docker image via Gitea Actions (if possible) or manual trigger
|
||||
- [ ] Task: Deploy the updated Nomad job
|
||||
- [ ] Task: Verify cluster health and process distribution using `cluster_status` script
|
||||
- [ ] Task: Perform a manual failover (stop primary allocation) and verify Navidrome migrates correctly
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Failover Verification' (Protocol in workflow.md)
|
||||
@@ -1,5 +0,0 @@
|
||||
# Track fix_navidrome_paths_20260209 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -1,8 +0,0 @@
|
||||
{
|
||||
"track_id": "fix_navidrome_paths_20260209",
|
||||
"type": "bug",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-09T14:30:00Z",
|
||||
"updated_at": "2026-02-09T14:30:00Z",
|
||||
"description": "Fix Navidrome database location to ensure it uses LiteFS mount and resolve process path conflicts."
|
||||
}
|
||||
@@ -1,17 +0,0 @@
|
||||
# Plan: Correct Navidrome Database and Plugins Location (`fix_navidrome_paths`)
|
||||
|
||||
## Phase 1: Configuration Updates [x]
|
||||
- [x] Task: Update `navidrome-litefs-v2.nomad` with corrected paths (76398de)
|
||||
- [x] Task: Update `entrypoint.sh` to handle plugins folder and environment cleanup (decb9f5)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Configuration Updates' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Build and Deployment [x]
|
||||
- [x] Task: Commit changes and push to Gitea to trigger build (045fc6e)
|
||||
- [x] Task: Monitor Gitea build completion (Build #26)
|
||||
- [x] Task: Deploy updated Nomad job (Job Version 6)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Build and Deployment' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Final Verification [x]
|
||||
- [x] Task: Verify database path via `lsof` on the Primary node (Verified: /data/navidrome.db)
|
||||
- [x] Task: Verify replication health using `cluster_status` script (Verified: All nodes in sync)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
@@ -1,25 +0,0 @@
|
||||
# Specification: Correct Navidrome Database and Plugins Location (`fix_navidrome_paths`)
|
||||
|
||||
## Overview
|
||||
Force Navidrome to use the `/data` LiteFS mount for its SQLite database by setting the `DATAFOLDER` to `/data`. To avoid the "Operation not permitted" error caused by LiteFS's restriction on directory creation, redirect the Navidrome plugins folder to persistent shared storage.
|
||||
|
||||
## Functional Requirements
|
||||
- **Nomad Job Configuration (`navidrome-litefs-v2.nomad`):**
|
||||
- Set `ND_DATAFOLDER="/data"`. This will force Navidrome to create and use `navidrome.db` on the LiteFS mount.
|
||||
- Set `ND_PLUGINSFOLDER="/shared_data/plugins"`. This prevents Navidrome from attempting to create a `plugins` directory in the read-only/virtual `/data` mount.
|
||||
- Keep `ND_CACHEFOLDER` and `ND_BACKUP_PATH` pointing to `/shared_data` subdirectories.
|
||||
- **Entrypoint Logic (`entrypoint.sh`):**
|
||||
- Ensure it creates `/shared_data/plugins` if it doesn't exist.
|
||||
- Remove the explicit `export ND_DATABASE_PATH` if it conflicts with the new `DATAFOLDER` logic, or keep it as an explicit override.
|
||||
- **Verification:**
|
||||
- Confirm via `lsof` that Navidrome is finally using `/data/navidrome.db`.
|
||||
- Confirm that LiteFS `/debug/vars` now reports the database in its active set.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Persistence:** Ensure all non-database files (plugins, cache, backups) are stored on the shared host mount (`/shared_data`) to survive container restarts and migrations.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Navidrome successfully starts with `/data` as its data folder.
|
||||
- [ ] No "Operation not permitted" errors occur during startup.
|
||||
- [ ] `lsof` confirms `/data/navidrome.db` is open by the Navidrome process.
|
||||
- [ ] LiteFS `txid` increases on the Primary and replicates to Replicas when Navidrome writes to the DB.
|
||||
@@ -1,26 +0,0 @@
|
||||
# Plan: Fix Odroid8 and Script Robustness (`fix_odroid8_and_script`)
|
||||
|
||||
## Phase 1: Script Robustness [x] [checkpoint: 860000b]
|
||||
- [x] Task: Update `nomad_client.py` to handle subprocess errors gracefully
|
||||
- [x] Write tests for handling Nomad CLI absence/failure
|
||||
- [x] Update implementation to return descriptive error objects or `None` without crashing
|
||||
- [x] Task: Update aggregator and formatter to handle Nomad errors
|
||||
- [x] Update `cluster_aggregator.py` to gracefully skip Nomad calls if they fail
|
||||
- [x] Update `output_formatter.py` to display "Nomad Error" in relevant cells
|
||||
- [x] Add a global "Nomad Connectivity Warning" to the summary
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Script Robustness' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Odroid8 Recovery [ ]
|
||||
- [x] Task: Identify and verify `odroid8` LiteFS data path
|
||||
- [x] Run `nomad alloc status` to find the volume mount for `odroid8`
|
||||
- [x] Provide the user with the exact host path to the LiteFS data
|
||||
- [x] Task: Guide user through manual cleanup
|
||||
- [x] Provide steps to stop the allocation
|
||||
- [x] Provide the `rm` command to clear the LiteFS metadata
|
||||
- [x] Provide steps to restart and verify the node
|
||||
- [~] Task: Conductor - User Manual Verification 'Phase 2: Odroid8 Recovery' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Final Verification [x]
|
||||
- [x] Task: Final verification run of the script
|
||||
- [x] Task: Verify cluster health in Consul and LiteFS API
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
@@ -1,22 +0,0 @@
|
||||
# Plan: Implement TTL Heartbeat Service Registration (`implement_ttl_heartbeat`)
|
||||
|
||||
## Phase 1: Container Environment Preparation [x] [checkpoint: 51b8fce]
|
||||
- [x] Task: Update `Dockerfile` to install `curl` and `jq` (f7fe258)
|
||||
- [x] Task: Verify `litefs.yml` points to `entrypoint.sh` (should already be correct) (verified)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Container Environment Preparation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Script Implementation [x] [checkpoint: 139016f]
|
||||
- [x] Task: Refactor `entrypoint.sh` with the TTL Heartbeat logic (d977301)
|
||||
- [x] Implement `register_service` with TTL check definition
|
||||
- [x] Implement `pass_ttl` loop
|
||||
- [x] Implement robust `stop_app` and signal trapping
|
||||
- [x] Ensure correct Primary/Replica detection logic (LiteFS 0.5: Primary = No `.primary` file)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Script Implementation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Verification [ ]
|
||||
- [ ] Task: Commit changes and push to Gitea to trigger build
|
||||
- [ ] Task: Monitor Gitea build completion
|
||||
- [ ] Task: Deploy updated Nomad job (forcing update if necessary)
|
||||
- [ ] Task: Verify "Clean" state in Consul (only one primary registered)
|
||||
- [ ] Task: Verify Failover/Stop behavior (immediate deregistration vs TTL expiry)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Verification' (Protocol in workflow.md)
|
||||
@@ -1,23 +0,0 @@
|
||||
# Plan: Update Monitor Discovery Logic (`update_monitor_discovery`)
|
||||
|
||||
## Phase 1: Nomad Discovery Enhancement [x] [checkpoint: 353683e]
|
||||
- [x] Task: Update `nomad_client.py` to fetch job allocations with IPs (353683e)
|
||||
- [x] Write tests for parsing allocation IPs from `nomad job status` or `nomad alloc status`
|
||||
- [x] Implement `get_job_allocations(job_id)` returning a list of dicts (id, node, ip)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Nomad Discovery Enhancement' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Aggregator Refactor [x] [checkpoint: 655a9b2]
|
||||
- [x] Task: Refactor `cluster_aggregator.py` to drive discovery via Nomad (655a9b2)
|
||||
- [x] Update `get_cluster_status` to call `nomad_client.get_job_allocations` first
|
||||
- [x] Update loop to iterate over allocations and supplement with LiteFS and Consul data
|
||||
- [x] Task: Update `consul_client.py` to fetch all services once and allow lookup by IP/ID (655a9b2)
|
||||
- [x] Task: Update tests for the new discovery flow (655a9b2)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Aggregator Refactor' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: UI and Health Logic [x] [checkpoint: 21e9c3d]
|
||||
- [x] Task: Update `output_formatter.py` for "Standby" nodes (21e9c3d)
|
||||
- [x] Update table formatting to handle missing Consul status for replicas
|
||||
- [x] Task: Update Cluster Health calculation (21e9c3d)
|
||||
- [x] "Healthy" = 1 Primary (Consul passing) + N Replicas (LiteFS connected)
|
||||
- [x] Task: Final verification run (21e9c3d)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
Reference in New Issue
Block a user