chore(conductor): Archive track 'implement_ttl_heartbeat'

This commit is contained in:
2026-02-09 06:06:48 -08:00
parent ed8e7608f1
commit 9b6159a40c
4 changed files with 67 additions and 0 deletions

View File

@@ -0,0 +1,5 @@
# Track implement_ttl_heartbeat_20260208 Context
- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)

View File

@@ -0,0 +1,8 @@
{
"track_id": "implement_ttl_heartbeat_20260208",
"type": "enhancement",
"status": "new",
"created_at": "2026-02-08T19:00:00Z",
"updated_at": "2026-02-08T19:00:00Z",
"description": "Implement TTL Heartbeat architecture for robust Consul service registration and cleaner failure handling."
}

View File

@@ -0,0 +1,22 @@
# Plan: Implement TTL Heartbeat Service Registration (`implement_ttl_heartbeat`)
## Phase 1: Container Environment Preparation [x] [checkpoint: 51b8fce]
- [x] Task: Update `Dockerfile` to install `curl` and `jq` (f7fe258)
- [x] Task: Verify `litefs.yml` points to `entrypoint.sh` (should already be correct) (verified)
- [x] Task: Conductor - User Manual Verification 'Phase 1: Container Environment Preparation' (Protocol in workflow.md)
## Phase 2: Script Implementation [x] [checkpoint: 139016f]
- [x] Task: Refactor `entrypoint.sh` with the TTL Heartbeat logic (d977301)
- [x] Implement `register_service` with TTL check definition
- [x] Implement `pass_ttl` loop
- [x] Implement robust `stop_app` and signal trapping
- [x] Ensure correct Primary/Replica detection logic (LiteFS 0.5: Primary = No `.primary` file)
- [x] Task: Conductor - User Manual Verification 'Phase 2: Script Implementation' (Protocol in workflow.md)
## Phase 3: Deployment and Verification [ ]
- [~] Task: Commit changes and push to Gitea to trigger build
- [ ] Task: Monitor Gitea build completion
- [ ] Task: Deploy updated Nomad job (forcing update if necessary)
- [ ] Task: Verify "Clean" state in Consul (only one primary registered)
- [ ] Task: Verify Failover/Stop behavior (immediate deregistration vs TTL expiry)
- [ ] Task: Conductor - User Manual Verification 'Phase 3: Deployment and Verification' (Protocol in workflow.md)

View File

@@ -0,0 +1,32 @@
# Specification: Implement TTL Heartbeat Service Registration (`implement_ttl_heartbeat`)
## Overview
Replace the current "register and forget" Consul registration logic with a robust "TTL Heartbeat" pattern. This ensures that only the active Primary node is registered in Consul, and service entries are automatically removed (deregistered) if the node crashes, failover occurs, or Nomad stops the allocation.
## Functional Requirements
- **Supervisor Script (`entrypoint.sh`):**
- Refactor to implement the "Self-Registration" pattern with TTL checks.
- **Leadership Detection:** Monitor `/data/.primary` (LiteFS 0.5).
- **Primary:** Absence of file. Start Navidrome, register service with TTL.
- **Replica:** Presence of file. Stop Navidrome, deregister service.
- **Heartbeat:** Periodically (e.g., every 5-10s) PUT to Consul to pass the TTL check while Primary.
- **Signal Handling:** Trap `SIGTERM`/`SIGINT` to gracefully stop Navidrome and deregister immediately.
- **Docker Image:**
- Ensure `curl` and `jq` are installed (prerequisites for the script).
- **Nomad Configuration:**
- Ensure `NOMAD_IP_http` and `NOMAD_PORT_http` are accessible to the task (standard, but verifying).
## Non-Functional Requirements
- **Resilience:** The script must handle Consul unavailability gracefully (retries) without crashing the application loop.
- **Cleanliness:** No "ghost" services. Replicas must not appear in the service catalog.
## Acceptance Criteria
- [ ] Navidrome runs ONLY on the Primary.
- [ ] Only ONE `navidrome` service is registered in Consul (pointing to the Primary).
- [ ] Stopping the Primary allocation results in immediate deregistration (via trap).
- [ ] Hard killing the Primary allocation results in deregistration after TTL expires (approx 15s).
- [ ] Replicas do not register any service.
## Implementation Details
- **Script Name:** We will stick with `entrypoint.sh` for consistency with `litefs.yml` configuration, refactoring its content.
- **Service ID:** Use `navidrome-${NOMAD_ALLOC_ID}` to ensure uniqueness and traceability.