Files
consul-monitor/design.md
2025-08-09 19:27:29 -07:00

318 lines
9.5 KiB
Markdown

# Consul Service Monitor - Design Document
## Overview
A web-based dashboard application that monitors and visualizes the health status of services registered in HashiCorp Consul. The application provides real-time monitoring with historical health tracking capabilities.
## Architecture
### High-Level Components
1. **Web Frontend** - Interactive dashboard displaying service status
2. **Backend API** - REST API for data retrieval and configuration
3. **Data Collection Service** - Background service polling Consul for health data
4. **SQLite Database** - Historical health check data storage
5. **Consul Integration** - Service discovery and health check monitoring
### Technology Stack
- **Frontend**: HTML5, CSS3, JavaScript (with Chart.js for visualizations)
- **Backend**: Python 3.9+ with Flask
- **Database**: SQLite (ephemeral storage)
- **Service Discovery**: HashiCorp Consul (consul.service.dc1.consul)
- **Updates**: Periodic polling (no WebSockets needed)
## Functional Requirements
### Core Features
#### 1. Service List Display
- Display all services registered in Consul
- Show service name, ID, and tags
- Provide clickable links to service URLs
- Support sorting and filtering
#### 2. Health Status Visualization
- **Current Status Indicator**
- Green icon: All health checks passing
- Red icon: One or more health checks failing
- Yellow icon: Warning state (if supported)
- **Historical Status Chart**
- Mini bar chart showing 24-hour health history
- Time-based visualization (hourly aggregation)
- Color-coded status representation
#### 3. Auto-refresh Functionality
- Toggle switch to enable/disable auto-refresh
- Configurable refresh interval (30s, 1m, 2m, 5m, 10m)
- Visual indicator when auto-refresh is active
- Manual refresh button
#### 4. Configuration Management
- Session-based storage of user preferences (no persistence needed)
- Configurable history granularity (5m, 15m, 30m, 1h) - default: 15 minutes
## Database Schema
### Tables
```sql
-- Services table
CREATE TABLE services (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
address TEXT,
port INTEGER,
tags TEXT, -- JSON array
meta TEXT, -- JSON object
first_seen DATETIME DEFAULT CURRENT_TIMESTAMP,
last_seen DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Health checks table
CREATE TABLE health_checks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
service_id TEXT NOT NULL,
check_id TEXT NOT NULL,
check_name TEXT,
status TEXT NOT NULL, -- 'passing', 'warning', 'critical'
output TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (service_id) REFERENCES services (id)
);
-- Configuration table (session-based, optional for defaults)
CREATE TABLE config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Service URLs are generated using pattern: http://{service_name}.service.dc1.consul:{port}
-- Indexes for performance
CREATE INDEX idx_health_checks_service_timestamp ON health_checks (service_id, timestamp);
CREATE INDEX idx_health_checks_timestamp ON health_checks (timestamp);
```
## API Design
### REST Endpoints
```python
# Flask routes
GET /
- Serves main dashboard HTML page
GET /api/services
- Returns list of all services with current health status
- Generated URLs: http://{service_name}.service.dc1.consul:{port}
- Response: Array of service objects with health summary
GET /api/services/<service_id>/history
- Returns historical health data for charts
- Query params: ?granularity=15 (minutes: 5,15,30,60)
- Response: Time-series data for Chart.js
POST /api/config
- Updates session configuration
- Body: { "autoRefresh": true, "refreshInterval": 60, "historyGranularity": 15 }
GET /api/config
- Returns current session configuration
```
## Data Collection Service
### Polling Strategy
```yaml
Consul Polling:
- Interval: 60 seconds
- Consul Address: consul.service.dc1.consul:8500
- Endpoints:
- /v1/agent/services (service discovery)
- /v1/health/service/{service} (health checks)
- No authentication required
- Error handling: Log errors, continue polling
- Expected services: 30-40 services
Data Retention:
- Keep detailed data for 24 hours only (ephemeral storage)
- No long-term aggregation needed
- Database recreated on container restart
```
### Health Check Processing
1. **Data Collection**
- Poll Consul API for service list
- For each service, fetch health check status
- Store raw health check data with timestamps
2. **Status Aggregation**
- Service-level status: Worst status among all checks
- Historical aggregation: Count of passing/warning/critical per time window
3. **Change Detection**
- Compare current status with previous poll
- Trigger notifications/updates on status changes
- Maintain service registration/deregistration events
## Frontend Design
### Main Dashboard Layout
```
┌─────────────────────────────────────────────────┐
│ Consul Service Monitor [⚙️] [🔄] │
├─────────────────────────────────────────────────┤
│ Auto-refresh: [ON/OFF] Interval: [1m ▼] │
│ History granularity: [15m ▼] │
├─────────────────────────────────────────────────┤
│ Service Name │ Status │ URL │ History │
├─────────────────┼────────┼──────────┼───────────┤
│ web-api │ 🟢 │ [link] │ ▆▆█▆█▆▆ │
│ database │ 🔴 │ [link] │ █▆▆▄▂▂▄ │
│ cache-service │ 🟢 │ [link] │ ████████ │
└─────────────────────────────────────────────────┘
```
### Interactive Elements
- **Status Icons**: Visual indicators only (no detailed popup needed)
- **History Charts**: Chart.js mini bar charts with 24-hour data
- **Service Links**: URLs generated as http://{service_name}.service.dc1.consul:{port}
- **Desktop-optimized**: No mobile responsive design required
### Updates
- Periodic AJAX polling for updates
- Configurable refresh intervals (30s, 1m, 2m, 5m, 10m)
- Visual loading indicators during refresh
## Configuration Management
### User Settings (Session-based)
```json
{
"autoRefresh": {
"enabled": false,
"interval": 60,
"options": [30, 60, 120, 300, 600]
},
"display": {
"historyGranularity": 15,
"granularityOptions": [5, 15, 30, 60]
}
}
```
### System Configuration
```python
# Flask configuration
CONSUL_HOST = "consul.service.dc1.consul"
CONSUL_PORT = 8500
DATABASE_PATH = ":memory:" # Ephemeral SQLite
POLL_INTERVAL = 60 # seconds
MAX_SERVICES = 50 # Safety limit
```
## Deployment Considerations
### Docker Deployment
```dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Expose port
EXPOSE 5000
# Set environment variables
ENV FLASK_APP=app.py
ENV FLASK_ENV=production
ENV CONSUL_HOST=consul.service.dc1.consul
ENV CONSUL_PORT=8500
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:5000/health || exit 1
CMD ["python", "-m", "flask", "run", "--host=0.0.0.0"]
```
### Python Dependencies (requirements.txt)
```
Flask==2.3.3
requests==2.31.0
sqlite3 # Built-in
APScheduler==3.10.4 # For background polling
```
### Environment Variables
- `CONSUL_HOST`: Consul server hostname (default: consul.service.dc1.consul)
- `CONSUL_PORT`: Consul server port (default: 8500)
- `FLASK_PORT`: Web server port (default: 5000)
- `POLL_INTERVAL`: Health check polling interval in seconds (default: 60)
### Health Checks
The application should expose its own health endpoint:
- `GET /health`: Returns application health status
- `GET /metrics`: Prometheus-style metrics (optional)
## Security Considerations
1. **Consul Access**: No authentication required for your setup
2. **Database**: Ephemeral SQLite in container memory
3. **Web Interface**: Open dashboard, no authentication needed
4. **Input Validation**: Sanitize service names and configuration inputs
5. **Container Security**: Run as non-root user in container
## Future Enhancements
- **Alerting**: Email/Slack notifications on service failures (mentioned as future feature)
- **Service Filtering**: Search and filter capabilities for larger service lists
- **Service Details**: Detailed health check information popup/modal
- **Themes**: Dark/light mode toggle
- **Export**: Export health data as CSV/JSON
- **Custom Time Ranges**: Configurable history periods beyond 24 hours
## Development Phases
### Phase 1: Core Functionality
- Basic Consul integration
- SQLite database setup
- Simple web interface
- Manual refresh capability
### Phase 2: Real-time Features
- Auto-refresh functionality
- WebSocket integration
- Historical data visualization
- Configuration persistence
### Phase 3: Enhanced UX
- Responsive design
- Advanced filtering
- Performance optimizations
- Error handling improvements
### Phase 4: Production Ready
- Docker deployment
- Security hardening
- Monitoring and logging
- Documentation and testing