Compare commits
8 Commits
f8a2a587d5
...
3c5968690c
| Author | SHA1 | Date | |
|---|---|---|---|
| 3c5968690c | |||
| 139016f121 | |||
| d97730174d | |||
| 27b10a39b8 | |||
| 51b8fce10b | |||
| f7fe258480 | |||
| 5a557145ac | |||
| 01fa06e7dc |
@@ -6,7 +6,7 @@ FROM ghcr.io/navidrome/navidrome:latest
|
||||
|
||||
# Install dependencies
|
||||
USER root
|
||||
RUN apk add --no-cache fuse3 ca-certificates bash curl
|
||||
RUN apk add --no-cache fuse3 ca-certificates bash curl jq
|
||||
|
||||
# Copy LiteFS binary
|
||||
COPY --from=litefs /usr/local/bin/litefs /usr/local/bin/litefs
|
||||
|
||||
5
conductor/archive/fix_litefs_config_20260208/index.md
Normal file
5
conductor/archive/fix_litefs_config_20260208/index.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# Track fix_litefs_config_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "fix_litefs_config_20260208",
|
||||
"type": "feature",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T18:00:00Z",
|
||||
"updated_at": "2026-02-08T18:00:00Z",
|
||||
"description": "Fix LiteFS configuration to use 'exec' for Navidrome and ensure it only runs on the Primary node. Also fix DB path configuration."
|
||||
}
|
||||
22
conductor/archive/fix_litefs_config_20260208/plan.md
Normal file
22
conductor/archive/fix_litefs_config_20260208/plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Plan: Fix LiteFS Configuration and Process Management (`fix_litefs_config`)
|
||||
|
||||
## Phase 1: Configuration and Image Structure [ ]
|
||||
- [x] Task: Update `litefs.yml` to include the `exec` block (396dfeb)
|
||||
- [x] Task: Update `Dockerfile` to use LiteFS as the supervisor (`ENTRYPOINT ["litefs", "mount"]`) (ef91b8e)
|
||||
- [x] Task: Update `navidrome-litefs-v2.nomad` with corrected storage paths (`ND_DATAFOLDER`, `ND_CACHEFOLDER`, `ND_BACKUP_PATH`) (5cbb657)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 1: Configuration and Image Structure' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Entrypoint and Registration Logic [x] [checkpoint: 9cd5455]
|
||||
- [x] Task: Refactor `entrypoint.sh` to handle leadership-aware process management (9cd5455)
|
||||
- [x] Integrate Consul registration logic (from `register.sh`)
|
||||
- [x] Implement loop to start/stop Navidrome based on `/data/.primary` existence
|
||||
- [x] Ensure proper signal handling for Navidrome shutdown
|
||||
- [x] Task: Clean up redundant scripts (e.g., `register.sh` if fully integrated) (9cd5455)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 2: Entrypoint and Registration Logic' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Failover Verification [ ]
|
||||
- [~] Task: Build and push the updated Docker image via Gitea Actions (if possible) or manual trigger
|
||||
- [~] Task: Deploy the updated Nomad job
|
||||
- [ ] Task: Verify cluster health and process distribution using `cluster_status` script
|
||||
- [ ] Task: Perform a manual failover (stop primary allocation) and verify Navidrome migrates correctly
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Failover Verification' (Protocol in workflow.md)
|
||||
38
conductor/archive/fix_litefs_config_20260208/spec.md
Normal file
38
conductor/archive/fix_litefs_config_20260208/spec.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Specification: Fix LiteFS Configuration and Process Management (`fix_litefs_config`)
|
||||
|
||||
## Overview
|
||||
Reconfigure the Navidrome/LiteFS process management to ensure Navidrome and its Consul service registration only occur on the Primary node. This will be achieved by leveraging the LiteFS `exec` block and updating the `entrypoint.sh` logic. Additionally, correct the Navidrome database and storage paths to properly utilize the LiteFS replicated mount.
|
||||
|
||||
## Functional Requirements
|
||||
- **LiteFS Configuration (`litefs.yml`):**
|
||||
- Enable the `exec` block to trigger `/usr/local/bin/entrypoint.sh`.
|
||||
- This allows LiteFS to manage the lifecycle of the application.
|
||||
- **Entrypoint Logic (`entrypoint.sh`):**
|
||||
- Implement a supervision loop that monitors leadership via the `/data/.primary` file.
|
||||
- **On Primary:**
|
||||
- Register the node as the `navidrome` (primary) service in Consul.
|
||||
- Start the Navidrome process.
|
||||
- **On Replica:**
|
||||
- Ensure Navidrome is NOT running.
|
||||
- Deregister the `navidrome` primary service if previously registered.
|
||||
- (Optional) Register as a replica service or simply wait.
|
||||
- **On Transition:** Handle graceful shutdown of Navidrome if the node loses leadership.
|
||||
- **Storage and Path Configuration (`navidrome-litefs-v2.nomad`):**
|
||||
- Set `ND_DATAFOLDER` to `/data` (the LiteFS FUSE mount).
|
||||
- Set `ND_CACHEFOLDER` to `/shared_data/cache` (shared persistent storage).
|
||||
- Set `ND_BACKUP_PATH` to `/shared_data/backup` (shared persistent storage).
|
||||
- **Dockerfile Updates:**
|
||||
- Update `ENTRYPOINT` to `["litefs", "mount"]` to allow LiteFS to act as the supervisor.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Robustness:** Use a simple bash loop for process management to avoid extra dependencies.
|
||||
- **Signal Handling:** Ensure signals (SIGTERM) are correctly forwarded to Navidrome for graceful shutdown.
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Navidrome process runs ONLY on the Primary node.
|
||||
- [ ] Consul service `navidrome` correctly points to the current Primary.
|
||||
- [ ] Navidrome database (`navidrome.db`) is confirmed to be on the `/data` mount.
|
||||
- [ ] Cluster failover correctly stops Navidrome on the old primary and starts it on the new one.
|
||||
|
||||
## Out of Scope
|
||||
- Implementation of complex init systems like `tini` (bash loop selected by user).
|
||||
@@ -8,6 +8,7 @@
|
||||
## Storage & Database
|
||||
- **SQLite:** The primary relational database used by Navidrome for metadata and state.
|
||||
- **LiteFS:** A FUSE-based filesystem that provides synchronous replication of the SQLite database across the cluster.
|
||||
- **Process Management:** LiteFS-supervised with a leadership-aware entrypoint script ensuring Navidrome only runs on the primary node.
|
||||
|
||||
## Automation & Delivery
|
||||
- **Gitea Actions:** Automates the multi-arch (AMD64/ARM64) building and pushing of the custom supervised container image.
|
||||
|
||||
22
conductor/tracks/implement_ttl_heartbeat_20260208/plan.md
Normal file
22
conductor/tracks/implement_ttl_heartbeat_20260208/plan.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Plan: Implement TTL Heartbeat Service Registration (`implement_ttl_heartbeat`)
|
||||
|
||||
## Phase 1: Container Environment Preparation [x] [checkpoint: 51b8fce]
|
||||
- [x] Task: Update `Dockerfile` to install `curl` and `jq` (f7fe258)
|
||||
- [x] Task: Verify `litefs.yml` points to `entrypoint.sh` (should already be correct) (verified)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Container Environment Preparation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Script Implementation [x] [checkpoint: 139016f]
|
||||
- [x] Task: Refactor `entrypoint.sh` with the TTL Heartbeat logic (d977301)
|
||||
- [x] Implement `register_service` with TTL check definition
|
||||
- [x] Implement `pass_ttl` loop
|
||||
- [x] Implement robust `stop_app` and signal trapping
|
||||
- [x] Ensure correct Primary/Replica detection logic (LiteFS 0.5: Primary = No `.primary` file)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Script Implementation' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: Deployment and Verification [ ]
|
||||
- [ ] Task: Commit changes and push to Gitea to trigger build
|
||||
- [ ] Task: Monitor Gitea build completion
|
||||
- [ ] Task: Deploy updated Nomad job (forcing update if necessary)
|
||||
- [ ] Task: Verify "Clean" state in Consul (only one primary registered)
|
||||
- [ ] Task: Verify Failover/Stop behavior (immediate deregistration vs TTL expiry)
|
||||
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Verification' (Protocol in workflow.md)
|
||||
157
entrypoint.sh
157
entrypoint.sh
@@ -3,31 +3,85 @@ set -e
|
||||
|
||||
# Configuration from environment
|
||||
SERVICE_NAME="navidrome"
|
||||
# Use Nomad allocation ID for a unique service ID
|
||||
SERVICE_ID="${SERVICE_NAME}-${NOMAD_ALLOC_ID:-$(hostname)}"
|
||||
PORT=4533
|
||||
CONSUL_HTTP_ADDR="${CONSUL_URL:-http://localhost:8500}"
|
||||
NODE_IP="${ADVERTISE_IP}"
|
||||
DB_LOCK_FILE="/data/.primary"
|
||||
NAVIDROME_PID=0
|
||||
|
||||
# Tags for the Primary service (Traefik enabled)
|
||||
PRIMARY_TAGS='["navidrome","web","traefik.enable=true","urlprefix-/navidrome","tools","traefik.http.routers.navidromelan.rule=Host(`navidrome.service.dc1.consul`)","traefik.http.routers.navidromewan.rule=Host(`m.fbleagh.duckdns.org`)","traefik.http.routers.navidromewan.middlewares=dex@consulcatalog","traefik.http.routers.navidromewan.tls=true"]'
|
||||
|
||||
NAVIDROME_PID=""
|
||||
SERVICE_ID="navidrome-${NODE_IP}-${SERVICE_NAME}"
|
||||
# --- Helper Functions ---
|
||||
|
||||
cleanup() {
|
||||
echo "Caught signal, shutting down..."
|
||||
if [ -n "$NAVIDROME_PID" ]; then
|
||||
echo "Stopping Navidrome (PID: $NAVIDROME_PID)..."
|
||||
kill -TERM "$NAVIDROME_PID"
|
||||
wait "$NAVIDROME_PID" || true
|
||||
fi
|
||||
echo "Deregistering service ${SERVICE_ID} from Consul..."
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/deregister/${SERVICE_ID}" || true
|
||||
exit 0
|
||||
# Register Service with TTL Check
|
||||
register_service() {
|
||||
echo "Promoted! Registering service ${SERVICE_ID}..."
|
||||
# Convert bash list string to JSON array if needed, but PRIMARY_TAGS is already JSON-like
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/register" -d "{
|
||||
\"ID\": \"${SERVICE_ID}\",
|
||||
\"Name\": \"${SERVICE_NAME}\",
|
||||
\"Tags\": ${PRIMARY_TAGS},
|
||||
\"Address\": \"${NODE_IP}\",
|
||||
\"Port\": ${PORT},
|
||||
\"Check\": {
|
||||
\"DeregisterCriticalServiceAfter\": \"1m\",
|
||||
\"TTL\": \"15s\"
|
||||
}
|
||||
}"
|
||||
}
|
||||
|
||||
trap cleanup SIGTERM SIGINT
|
||||
# Send Heartbeat to Consul
|
||||
pass_ttl() {
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/check/pass/service:${SERVICE_ID}" > /dev/null
|
||||
}
|
||||
|
||||
echo "Starting leadership-aware entrypoint..."
|
||||
# Deregister Service
|
||||
deregister_service() {
|
||||
echo "Demoted/Stopping. Deregistering service ${SERVICE_ID}..."
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/deregister/${SERVICE_ID}"
|
||||
}
|
||||
|
||||
# Start Navidrome in Background
|
||||
start_app() {
|
||||
echo "Node is Primary. Starting Navidrome..."
|
||||
|
||||
# Ensure DB path and local data folder are set
|
||||
export ND_DATABASE_PATH="/data/navidrome.db"
|
||||
export ND_DATAFOLDER="/local/data"
|
||||
mkdir -p /local/data
|
||||
|
||||
/app/navidrome &
|
||||
NAVIDROME_PID=$!
|
||||
echo "Navidrome started with PID ${NAVIDROME_PID}"
|
||||
}
|
||||
|
||||
# Stop Navidrome
|
||||
stop_app() {
|
||||
if [ "${NAVIDROME_PID}" -gt 0 ]; then
|
||||
echo "Stopping Navidrome (PID ${NAVIDROME_PID})..."
|
||||
kill -SIGTERM "${NAVIDROME_PID}"
|
||||
wait "${NAVIDROME_PID}" 2>/dev/null || true
|
||||
NAVIDROME_PID=0
|
||||
fi
|
||||
}
|
||||
|
||||
# --- Signal Handling (The Safety Net) ---
|
||||
# If Nomad stops the container, we stop the app and deregister.
|
||||
cleanup() {
|
||||
echo "Caught signal, shutting down..."
|
||||
stop_app
|
||||
deregister_service
|
||||
exit 0
|
||||
}
|
||||
|
||||
trap cleanup TERM INT
|
||||
|
||||
# --- Main Loop ---
|
||||
|
||||
echo "Starting Supervisor. Waiting for leadership settle..."
|
||||
echo "Node IP: $NODE_IP"
|
||||
echo "Consul: $CONSUL_HTTP_ADDR"
|
||||
|
||||
@@ -35,51 +89,36 @@ echo "Consul: $CONSUL_HTTP_ADDR"
|
||||
sleep 5
|
||||
|
||||
while true; do
|
||||
# In LiteFS 0.5, .primary file exists ONLY on replicas.
|
||||
if [ ! -f /data/.primary ]; then
|
||||
# PRIMARY STATE
|
||||
if [ -z "$NAVIDROME_PID" ] || ! kill -0 "$NAVIDROME_PID" 2>/dev/null; then
|
||||
echo "Node is Primary. Initializing Navidrome..."
|
||||
|
||||
# Register in Consul
|
||||
echo "Registering as primary in Consul..."
|
||||
curl -s -X PUT -d "{
|
||||
\"ID\": \"${SERVICE_ID}\",
|
||||
\"Name\": \"${SERVICE_NAME}\",
|
||||
\"Tags\": ${PRIMARY_TAGS},
|
||||
\"Address\": \"${NODE_IP}\",
|
||||
\"Port\": ${PORT},
|
||||
\"Check\": {
|
||||
\"HTTP\": \"http://${NODE_IP}:${PORT}/app\",
|
||||
\"Interval\": \"10s\",
|
||||
\"Timeout\": \"2s\"
|
||||
}
|
||||
}" "${CONSUL_HTTP_ADDR}/v1/agent/service/register"
|
||||
|
||||
echo "Starting Navidrome with ND_DATABASE_PATH=/data/navidrome.db"
|
||||
export ND_DATABASE_PATH="/data/navidrome.db"
|
||||
export ND_DATAFOLDER="/local/data"
|
||||
# In LiteFS 0.5, .primary file exists ONLY on replicas.
|
||||
if [ ! -f "$DB_LOCK_FILE" ]; then
|
||||
# === WE ARE PRIMARY ===
|
||||
|
||||
# 1. If App is not running, start it and register
|
||||
if [ "${NAVIDROME_PID}" -eq 0 ] || ! kill -0 "${NAVIDROME_PID}" 2>/dev/null; then
|
||||
if [ "${NAVIDROME_PID}" -gt 0 ]; then
|
||||
echo "CRITICAL: Navidrome crashed! Restarting..."
|
||||
fi
|
||||
start_app
|
||||
register_service
|
||||
fi
|
||||
|
||||
# Start Navidrome
|
||||
/app/navidrome &
|
||||
NAVIDROME_PID=$!
|
||||
echo "Navidrome started with PID $NAVIDROME_PID"
|
||||
fi
|
||||
else
|
||||
# REPLICA STATE
|
||||
if [ -n "$NAVIDROME_PID" ] && kill -0 "$NAVIDROME_PID" 2>/dev/null; then
|
||||
echo "Node transitioned to Replica. Stopping Navidrome..."
|
||||
kill -TERM "$NAVIDROME_PID"
|
||||
wait "$NAVIDROME_PID" || true
|
||||
NAVIDROME_PID=""
|
||||
|
||||
echo "Deregistering primary service from Consul..."
|
||||
curl -s -X PUT "${CONSUL_HTTP_ADDR}/v1/agent/service/deregister/${SERVICE_ID}" || true
|
||||
fi
|
||||
|
||||
# We don't register anything for replicas in this version to keep it simple.
|
||||
# But we stay alive so LiteFS keeps running.
|
||||
# 2. Maintain the heartbeat (TTL)
|
||||
pass_ttl
|
||||
|
||||
else
|
||||
# === WE ARE REPLICA ===
|
||||
|
||||
# If App is running (we were just demoted), stop it
|
||||
if [ "${NAVIDROME_PID}" -gt 0 ]; then
|
||||
echo "Lost leadership. Demoting..."
|
||||
stop_app
|
||||
deregister_service
|
||||
fi
|
||||
|
||||
sleep 5
|
||||
done
|
||||
# No service registration exists for replicas to keep Consul clean.
|
||||
fi
|
||||
|
||||
# Sleep short enough to update TTL (every 5s is safe for 15s TTL)
|
||||
sleep 5 &
|
||||
wait $! # Wait allows the 'trap' to interrupt the sleep instantly
|
||||
done
|
||||
Reference in New Issue
Block a user