Files
navidrome-litefs/conductor/archive/fix_odroid8_and_script_20260208/spec.md

1.8 KiB

Specification: Fix Odroid8 and Script Robustness (fix_odroid8_and_script)

Overview

Address the "critical" loop on node odroid8 caused by a LiteFS Cluster ID mismatch and improve the cluster_status script's error handling when the Nomad CLI is unavailable or misconfigured.

Functional Requirements

  • Node Recovery (odroid8):
    • Identify the specific LiteFS data directory on odroid8 (usually /var/lib/litefs inside the container, mapped to a host path).
    • Guide the user through stopping the allocation and wiping the metadata/data to resolve the Consul lease conflict.
  • Script Robustness:
    • Update nomad_client.py to handle subprocess failures more gracefully.
    • If a nomad command fails, the script should not print a traceback or confusing "non-zero exit status" messages to the primary output.
    • Instead, it should log the error to stderr and continue, marking the affected fields (like logs or full ID) as "Nomad Error".
    • Add a clear warning in the output if Nomad connectivity is lost, suggesting the user verify NOMAD_ADDR.

Non-Functional Requirements

  • Reliability: The script should remain functional even if one of the underlying tools (Nomad CLI) is broken.
  • Ease of Use: Provide clear, copy-pasteable commands for the manual cleanup process.

Acceptance Criteria

  • odroid8 node successfully joins the cluster and shows as passing in Consul.
  • cluster_status script runs without error even if the nomad binary is missing or cannot connect to the server (showing fallback info).
  • Script provides a helpful message when nomad commands fail.

Out of Scope

  • Fixing the Navidrome database path issue (this will be handled in a separate track once the cluster is stable).
  • Automating the host-level cleanup (manual guidance only).