2.1 KiB
2.1 KiB
Specification: Cluster Diagnosis and Script Enhancement (diagnose_and_enhance)
Overview
Investigate the root cause of the "critical" status on node odroid8 and enhance the cluster_status script to provide deeper diagnostic information, including Consul health check details, Nomad logs, and LiteFS status via nomad alloc exec.
Functional Requirements
- Consul Diagnostics:
- Update
consul_client.pyto fetch theOutputfield from Consul health checks. - Display the specific error message from Consul when a node is in a non-passing state.
- Update
- Nomad Integration:
- Implement a
nomad_client.py(or similar) to interact with the Nomad CLI. - Log Retrieval: For nodes with health issues, fetch the last 20 lines of
stderrfrom the relevant Nomad allocation. - Internal Status: Implement a method to run
litefs statusinside the container usingnomad alloc execto retrieve uptime and replication lag when the HTTP API is unavailable.
- Implement a
- LiteFS API Investigation:
- Investigate why the
/statusendpoint returns a 404 and attempt to resolve it via configuration or by identifying the correct endpoint for LiteFS 0.5.
- Investigate why the
- Output Formatting:
- Update the CLI table to display the retrieved uptime and replication lag.
- Add a "Diagnostics" section at the bottom of the output when errors are detected, showing the Consul check output and Nomad logs.
Non-Functional Requirements
- CLI Dependency: The script assumes the
nomadCLI is installed and configured in the local system's PATH. - Language: Python 3.x.
Acceptance Criteria
- Script displays the specific reason for
odroid8's critical status from Consul. - Script successfully retrieves and displays Nomad stderr logs for failing nodes.
- Script displays LiteFS uptime and replication lag (retrieved via
alloc execor HTTP). - The root cause of
odroid8's failure is identified and reported.
Out of Scope
- Automatic fixing of the identified issues (diagnosis only).
- Direct SSH into nodes (using Nomad/Consul APIs/CLIs only).