1.9 KiB
1.9 KiB
Plan: Cluster Diagnosis and Script Enhancement (diagnose_and_enhance)
Phase 1: Enhanced Diagnostics (Consul) [x] [checkpoint: a686c5b]
- Task: Update
consul_client.pyto fetch detailed health check output- Write tests for fetching
Outputfield from Consul checks - Implement logic to extract and store the
Output(error message)
- Write tests for fetching
- Task: Update aggregator and formatter to display Consul errors
- Update aggregation logic to include
consul_error - Update table formatter to indicate an error (maybe a flag or color)
- Add a "Diagnostics" section to the output to print full error details
- Update aggregation logic to include
- Task: Conductor - User Manual Verification 'Phase 1: Enhanced Diagnostics (Consul)' (Protocol in workflow.md)
Phase 2: Nomad Integration and Logs [x] [checkpoint: 6d77729]
- Task: Implement
nomad_client.pywrapper- Write tests for
get_allocation_logs,get_node_status, andrestart_allocation(mocking subprocess) - Implement
subprocess.run(["nomad", ...])logic to fetch logs and restart allocations
- Write tests for
- Task: Integrate Nomad logs into diagnosis
- Update aggregator to call Nomad client for critical nodes
- Update "Diagnostics" section to display the last 20 lines of stderr
- Task: Conductor - User Manual Verification 'Phase 2: Nomad Integration and Logs' (Protocol in workflow.md)
Phase 3: Advanced LiteFS Status [ ]
- Task: Implement
litefs_statusvianomad alloc exec- Write tests for executing remote commands via Nomad
- Update
litefs_client.pyto fallback tonomad alloc execif HTTP fails - Parse
litefs statusoutput (text/json) to extract uptime and replication lag
- Task: Final Polish and Diagnosis Run
- Ensure all pieces work together
- Run the script to diagnose
odroid8
- Task: Conductor - User Manual Verification 'Phase 3: Advanced LiteFS Status' (Protocol in workflow.md)