Files
navidrome-litefs/conductor/archive/diagnose_and_enhance_20260208/spec.md

2.1 KiB

Specification: Cluster Diagnosis and Script Enhancement (diagnose_and_enhance)

Overview

Investigate the root cause of the "critical" status on node odroid8 and enhance the cluster_status script to provide deeper diagnostic information, including Consul health check details, Nomad logs, and LiteFS status via nomad alloc exec.

Functional Requirements

  • Consul Diagnostics:
    • Update consul_client.py to fetch the Output field from Consul health checks.
    • Display the specific error message from Consul when a node is in a non-passing state.
  • Nomad Integration:
    • Implement a nomad_client.py (or similar) to interact with the Nomad CLI.
    • Log Retrieval: For nodes with health issues, fetch the last 20 lines of stderr from the relevant Nomad allocation.
    • Internal Status: Implement a method to run litefs status inside the container using nomad alloc exec to retrieve uptime and replication lag when the HTTP API is unavailable.
  • LiteFS API Investigation:
    • Investigate why the /status endpoint returns a 404 and attempt to resolve it via configuration or by identifying the correct endpoint for LiteFS 0.5.
  • Output Formatting:
    • Update the CLI table to display the retrieved uptime and replication lag.
    • Add a "Diagnostics" section at the bottom of the output when errors are detected, showing the Consul check output and Nomad logs.

Non-Functional Requirements

  • CLI Dependency: The script assumes the nomad CLI is installed and configured in the local system's PATH.
  • Language: Python 3.x.

Acceptance Criteria

  • Script displays the specific reason for odroid8's critical status from Consul.
  • Script successfully retrieves and displays Nomad stderr logs for failing nodes.
  • Script displays LiteFS uptime and replication lag (retrieved via alloc exec or HTTP).
  • The root cause of odroid8's failure is identified and reported.

Out of Scope

  • Automatic fixing of the identified issues (diagnosis only).
  • Direct SSH into nodes (using Nomad/Consul APIs/CLIs only).