1.9 KiB
1.9 KiB
Specification: Update Monitor Discovery Logic (update_monitor_discovery)
Overview
Refactor the cluster monitoring script (scripts/cluster_status) to use Nomad as the primary source of discovery. Currently, the script queries Consul for services, which only shows the Primary node in the new architecture. By querying Nomad allocations first, we can identify all running LiteFS nodes (Primary and Replicas) and then inspect their individual health and replication status.
Functional Requirements
- Nomad Client (
nomad_client.py):- Add a function to list all active allocations for a job and extract their host IPs and node names.
- Consul Client (
consul_client.py):- Modify to allow checking the registration status of a specific node/allocation ID rather than just listing all services.
- Aggregator (
cluster_aggregator.py):- New Discovery Flow:
- Query Nomad for all allocations of
navidrome-litefs. - For each allocation:
- Get the Node Name and IP.
- Query the LiteFS API (
:20202) on that IP for role/DB info. - Query Consul to see if a matching service registration exists (and its health).
- Query Nomad for all allocations of
- New Discovery Flow:
- Formatter (
output_formatter.py):- Handle nodes that are "Standby" (running in Nomad and LiteFS, but not registered in Consul).
- Ensure the table correctly displays all 4 nodes.
Non-Functional Requirements
- Efficiency: Minimize CLI calls by batching Nomad/Consul queries where possible.
- Robustness: Gracefully handle cases where an allocation has no IP yet (starting state).
Acceptance Criteria
- Script output shows all 4 Nomad allocations.
- Primary node is clearly identified with its Consul health status.
- Replica nodes are shown with their LiteFS role and DB status, even if not in Consul.
- Overall cluster health is calculated based on the existence of exactly one Primary and healthy replication on all nodes.