Files
navidrome-litefs/conductor/archive/diagnose_and_enhance_20260208/plan.md

1.9 KiB

Plan: Cluster Diagnosis and Script Enhancement (diagnose_and_enhance)

Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]

  • Task: Update consul_client.py to fetch detailed health check output
    • Write tests for fetching Output field from Consul checks
    • Implement logic to extract and store the Output (error message)
  • Task: Update aggregator and formatter to display Consul errors
    • Update aggregation logic to include consul_error
    • Update table formatter to indicate an error (maybe a flag or color)
    • Add a "Diagnostics" section to the output to print full error details
  • Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)

Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]

  • Task: Implement nomad_client.py wrapper
    • Write tests for get_allocation_logs, get_node_status, and restart_allocation (mocking subprocess)
    • Implement subprocess.run(["nomad", ...]) logic to fetch logs and restart allocations
  • Task: Integrate Nomad logs into diagnosis
    • Update aggregator to call Nomad client for critical nodes
    • Update "Diagnostics" section to display the last 20 lines of stderr
  • Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)

Phase 3: Advanced LiteFS Status [ ]

  • Task: Implement litefs_status via nomad alloc exec
    • Write tests for executing remote commands via Nomad
    • Update litefs_client.py to fallback to nomad alloc exec if HTTP fails
    • Parse litefs status output (text/json) to extract uptime and replication lag
  • Task: Final Polish and Diagnosis Run
    • Ensure all pieces work together
    • Run the script to diagnose odroid8
  • [~] Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)