1.8 KiB
1.8 KiB
Specification: Fix Odroid8 and Script Robustness (fix_odroid8_and_script)
Overview
Address the "critical" loop on node odroid8 caused by a LiteFS Cluster ID mismatch and improve the cluster_status script's error handling when the Nomad CLI is unavailable or misconfigured.
Functional Requirements
- Node Recovery (
odroid8):- Identify the specific LiteFS data directory on
odroid8(usually/var/lib/litefsinside the container, mapped to a host path). - Guide the user through stopping the allocation and wiping the metadata/data to resolve the Consul lease conflict.
- Identify the specific LiteFS data directory on
- Script Robustness:
- Update
nomad_client.pyto handlesubprocessfailures more gracefully. - If a
nomadcommand fails, the script should not print a traceback or confusing "non-zero exit status" messages to the primary output. - Instead, it should log the error to
stderrand continue, marking the affected fields (like logs or full ID) as "Nomad Error". - Add a clear warning in the output if Nomad connectivity is lost, suggesting the user verify
NOMAD_ADDR.
- Update
Non-Functional Requirements
- Reliability: The script should remain functional even if one of the underlying tools (Nomad CLI) is broken.
- Ease of Use: Provide clear, copy-pasteable commands for the manual cleanup process.
Acceptance Criteria
odroid8node successfully joins the cluster and shows aspassingin Consul.cluster_statusscript runs without error even if thenomadbinary is missing or cannot connect to the server (showing fallback info).- Script provides a helpful message when
nomadcommands fail.
Out of Scope
- Fixing the Navidrome database path issue (this will be handled in a separate track once the cluster is stable).
- Automating the host-level cleanup (manual guidance only).