Stability-First Architecture¶
Core Principle: Users should never need to manually manage the LLM server. The system must be self-healing and provide clear feedback when issues occur.
Design Tenets¶
1. Zero Manual Intervention¶
- Server automatically starts, stops, and restarts as needed
- No
pkillcommands required from users - System detects and recovers from crashes automatically
2. Clear Feedback Loop¶
- Every error message includes actionable solutions
- Users know exactly what's happening and what to do
- No silent failures or confusing states
3. Intelligent Recovery¶
- Detect server health issues (crashes, context mismatches, OOM)
- Automatically retry with safer configurations (CPU-only, smaller context)
- Provide feedback about degraded performance modes
4. Graceful Degradation¶
- When GPU fails → Fall back to CPU mode
- When context too large → Suggest splitting input or cloud backend
- When server unreachable → Auto-restart with clear messaging
Implementation¶
Health Check System¶
┌─────────────────────────────────────────┐
│ User runs: scmd /explain │
└──────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Pre-inference Health Check │
│ • Is server running? │
│ • Does context size match expected? │
│ • Is server responsive? │
└──────────────┬──────────────────────────┘
│
┌──────┴──────┐
│ Healthy? │
└──────┬──────┘
│
┌──────┴──────────┐
│ YES NO │
│ │
▼ ▼
┌─────────┐ ┌──────────────┐
│ Execute │ │ Auto-Restart │
│ Request │ │ with Retry │
└─────────┘ └──────┬───────┘
│
▼
┌──────────────┐
│ Show Clear │
│ Feedback │
└──────────────┘
Error Handling Hierarchy¶
- Detection: Parse error to identify root cause
- Context size exceeded
- Out of memory
- Server crashed
-
Connection failed
-
Auto-Recovery: Try to fix automatically
- Restart server
- Reduce context size
- Switch to CPU mode
-
Use different model
-
User Feedback: When auto-recovery fails
- Show what went wrong (in user terms)
- Show what was tried (transparency)
- Show what user can do (actionable steps)
- Provide one-command fixes when possible
Example Error Message Flow¶
BAD (Current state in some areas):
GOOD (Target state):
❌ Input too large for available GPU memory
What happened:
Your input (5502 tokens) exceeds GPU memory capacity (4096 tokens)
Metal GPU cannot allocate enough VRAM for the full context
What I tried:
✓ Started llama-server with 32K context
✗ GPU allocated only 4K (memory limitation)
What you can do:
1. Use CPU mode (slower, supports full 32K):
export SCMD_CPU_ONLY=1
scmd /explain <your-input>
2. Split your input into smaller files
3. Use cloud backend (fastest):
export OPENAI_API_KEY=your-key
scmd -b openai /explain <your-input>
Testing Strategy¶
Unit Tests¶
- Test each error detection pattern
- Test auto-recovery logic
- Test health check functions
Integration Tests¶
- Simulate server crashes
- Test recovery from OOM situations
- Validate error messages are helpful
User Testing¶
- Can users recover from errors without documentation?
- Do error messages make sense to non-technical users?
- Are suggested commands copy-pasteable and working?
Monitoring Points¶
- Server Startup
- Did server start successfully?
- Is reported context size what we requested?
-
Is GPU/CPU mode correct?
-
Inference Request
- Is server still running?
- Does context size still match?
-
Is response time acceptable?
-
Error Recovery
- Did auto-restart succeed?
- Did fallback configuration work?
- Was user informed of changes?
Success Metrics¶
- Zero manual server management: Users never use
pkillor manual restarts - Clear feedback: 100% of errors include actionable next steps
- Auto-recovery: 90%+ of issues resolved without user intervention
- User confidence: Users trust the system will handle server issues
Future Enhancements¶
- Persistent Server: Keep server running between commands
- Smart Caching: Cache model loads to reduce startup time
- Multi-Model Support: Automatically switch to smaller model when needed
- Cloud Fallback: Auto-suggest cloud when local fails repeatedly