chore(conductor): Archive track 'update_monitor_discovery'
This commit is contained in:
@@ -0,0 +1,5 @@
|
||||
# Track update_monitor_discovery_20260208 Context
|
||||
|
||||
- [Specification](./spec.md)
|
||||
- [Implementation Plan](./plan.md)
|
||||
- [Metadata](./metadata.json)
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"track_id": "update_monitor_discovery_20260208",
|
||||
"type": "enhancement",
|
||||
"status": "new",
|
||||
"created_at": "2026-02-08T20:00:00Z",
|
||||
"updated_at": "2026-02-08T20:00:00Z",
|
||||
"description": "Update cluster monitoring script to discover nodes via Nomad instead of Consul, ensuring all replicas are visible."
|
||||
}
|
||||
25
conductor/archive/update_monitor_discovery_20260208/plan.md
Normal file
25
conductor/archive/update_monitor_discovery_20260208/plan.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Plan: Update Monitor Discovery Logic (`update_monitor_discovery`)
|
||||
|
||||
## Phase 1: Nomad Discovery Enhancement [x] [checkpoint: 353683e]
|
||||
- [x] Task: Update `nomad_client.py` to fetch job allocations with IPs (353683e)
|
||||
- [x] Write tests for parsing allocation IPs from `nomad job status` or `nomad alloc status`
|
||||
- [x] Implement `get_job_allocations(job_id)` returning a list of dicts (id, node, ip)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 1: Nomad Discovery Enhancement' (Protocol in workflow.md)
|
||||
|
||||
## Phase 2: Aggregator Refactor [x] [checkpoint: 655a9b2]
|
||||
- [x] Task: Refactor `cluster_aggregator.py` to drive discovery via Nomad (655a9b2)
|
||||
- [x] Update `get_cluster_status` to call `nomad_client.get_job_allocations` first
|
||||
- [x] Update loop to iterate over allocations and supplement with LiteFS and Consul data
|
||||
- [x] Task: Update `consul_client.py` to fetch all services once and allow lookup by IP/ID (655a9b2)
|
||||
- [x] Task: Update tests for the new discovery flow (655a9b2)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 2: Aggregator Refactor' (Protocol in workflow.md)
|
||||
|
||||
## Phase 3: UI and Health Logic [x] [checkpoint: 21e9c3d]
|
||||
- [x] Task: Update `output_formatter.py` for "Standby" nodes (21e9c3d)
|
||||
- [x] Update table formatting to handle missing Consul status for replicas
|
||||
- [x] Task: Update Cluster Health calculation (21e9c3d)
|
||||
- [x] "Healthy" = 1 Primary (Consul passing) + N Replicas (LiteFS connected)
|
||||
- [~] Task: Extract Uptime from Nomad and internal LiteFS states (txid, checksum)
|
||||
- [~] Task: Update aggregator and formatter to display detailed database info
|
||||
- [x] Task: Final verification run (21e9c3d)
|
||||
- [x] Task: Conductor - User Manual Verification 'Phase 3: Final Verification' (Protocol in workflow.md)
|
||||
30
conductor/archive/update_monitor_discovery_20260208/spec.md
Normal file
30
conductor/archive/update_monitor_discovery_20260208/spec.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Specification: Update Monitor Discovery Logic (`update_monitor_discovery`)
|
||||
|
||||
## Overview
|
||||
Refactor the cluster monitoring script (`scripts/cluster_status`) to use Nomad as the primary source of discovery. Currently, the script queries Consul for services, which only shows the Primary node in the new architecture. By querying Nomad allocations first, we can identify all running LiteFS nodes (Primary and Replicas) and then inspect their individual health and replication status.
|
||||
|
||||
## Functional Requirements
|
||||
- **Nomad Client (`nomad_client.py`):**
|
||||
- Add a function to list all active allocations for a job and extract their host IPs and node names.
|
||||
- **Consul Client (`consul_client.py`):**
|
||||
- Modify to allow checking the registration status of a *specific* node/allocation ID rather than just listing all services.
|
||||
- **Aggregator (`cluster_aggregator.py`):**
|
||||
- **New Discovery Flow:**
|
||||
1. Query Nomad for all allocations of `navidrome-litefs`.
|
||||
2. For each allocation:
|
||||
- Get the Node Name and IP.
|
||||
- Query the LiteFS API (`:20202`) on that IP for role/DB info.
|
||||
- Query Consul to see if a matching service registration exists (and its health).
|
||||
- **Formatter (`output_formatter.py`):**
|
||||
- Handle nodes that are "Standby" (running in Nomad and LiteFS, but not registered in Consul).
|
||||
- Ensure the table correctly displays all 4 nodes.
|
||||
|
||||
## Non-Functional Requirements
|
||||
- **Efficiency:** Minimize CLI calls by batching Nomad/Consul queries where possible.
|
||||
- **Robustness:** Gracefully handle cases where an allocation has no IP yet (starting state).
|
||||
|
||||
## Acceptance Criteria
|
||||
- [ ] Script output shows all 4 Nomad allocations.
|
||||
- [ ] Primary node is clearly identified with its Consul health status.
|
||||
- [ ] Replica nodes are shown with their LiteFS role and DB status, even if not in Consul.
|
||||
- [ ] Overall cluster health is calculated based on the existence of exactly one Primary and healthy replication on all nodes.
|
||||
Reference in New Issue
Block a user