Spaces:
Sleeping
ENVIRONMENT.md โ Full Environment Specification
incident-response-env ยท OpenEnv-compliant ยท Apache 2.0
Complete technical reference for the environment. Read this before building a custom agent, extending the environment, or interpreting benchmark results.
Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HuggingFace Space โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ FastAPI โโโโโโโโโโบโ IncidentResponseEnv โ โ
โ โ server/ โ calls โ environment.py โ โ
โ โ app.py โ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ :7860 โ โ
โ โโโโโโโโฌโโโโโโโโ โ
โ โ HTTP โ
โ โโโโโโโโผโโโโโโโโ โ
โ โ Inference โโโโ LLM (HuggingFace / OpenAI) โ
โ โ .py โ via OpenAI-compatible API โ
โ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
REST API Reference
Base URL: http://localhost:7860 (local) or your HF Space URL.
Current Status: Microservices tasks only. CI/CD and Kafka simulators available via reward.py routing (experimental).
POST /reset
Start a new episode.
Request:
{
"task_id": "task_cpu_spike", // Choose from the 16 available microservices tasks
"seed": 42 // optional int โ for reproducible episodes
}
Response:
{
"message": "Incident active. A hot loop in JWT validation is pegging auth-service CPU at 99%...",
"step": 0,
"done": false,
"alert": "ALERT: Login latency p99 > 8s. Auth service CPU at 99%. Users cannot sign in.",
"metrics": null
}
POST /step
Take one action in the current episode.
Request:
{"action_type": "read_logs", "target": "notification-service"}
Response:
{
"observation": {
"message": "Logs from notification-service:\n[ERROR] java.lang.OutOfMemoryError...",
"step": 1,
"done": false,
"alert": "ALERT: High error rate detected...",
"metrics": null
},
"reward": {
"value": 0.1,
"reason": "found fault evidence in notification-service logs"
},
"done": false,
"info": {
"step": 1,
"cumulative_reward": 0.1,
"evidence_found": ["logs_fault_svc"]
}
}
GET /state
Inspect current episode state (includes hidden ground truth โ for debugging only).
Response:
{
"task_id": "task_cpu_spike",
"task_name": "Auth service CPU hard loop",
"difficulty": "easy",
"hidden_fault_service": "auth-service",
"hidden_fault_type": "cpu_spike",
"step_count": 3,
"max_steps": 10,
"done": false,
"cumulative_reward": 0.27,
"evidence_found": ["logs_fault_svc", "metrics_fault_svc"]
}
โ ๏ธ The agent should NEVER call
/stateduring an episode โ it exposes the answer. This endpoint is for debugging and visualization only.
GET /grade
Get the final episode score.
Response:
{"score": 0.8750}
Score is in [0.001, 0.999]. Only meaningful after done=true. Clamped to this range to ensure meaningful differentiation between episodes.
GET /health
Liveness check.
Response:
{"status": "ok", "env": "incident-response-env", "version": "1.0.0"}
GET /tasks
List all available tasks.
Response:
{
"task_cpu_spike": {
"name": "Auth service CPU hard loop",
"difficulty": "easy",
"max_steps": 10,
"description": "A hot loop in JWT validation is pegging auth-service CPU at 99%.",
"ideal_steps": 5,
"fault_service": "auth-service",
"fault_type": "cpu_spike",
"red_herrings": [],
"cascade_step": null
},
...
}
All 16 tasks:
task_cpu_spikeโ Easy (10 steps) โ Auth CPU hot looptask_disk_fullโ Easy (10 steps) โ Postgres WAL overflowtask_db_connection_leakโ Medium (15 steps) โ Connection pool exhausted + cascadetask_redis_memory_evictionโ Medium (15 steps) โ Cache eviction cascadetask_api_rate_limitโ Medium (15 steps) โ Rate limiter misconfigurationtask_deadlock_order_serviceโ Medium (15 steps) โ Database deadlocktask_ssl_cert_expiredโ Medium (15 steps) โ TLS certificate expirationtask_slow_query_postgresโ Medium (15 steps) โ Missing database indextask_auth_service_500โ Medium (15 steps) โ Internal server errorstask_k8s_pod_crashloopโ Medium (15 steps) โ Pod crash looptask_memory_leakโ Medium (15 steps) โ Memory exhaustion + GC pausestask_thread_starvationโ Medium (15 steps) โ Thread pool exhaustiontask_canary_poisonโ Hard (20 steps) โ Canary deployment stripping headers + 2 red herringstask_clock_skewโ Hard (20 steps) โ NTP drift causing token rejections + 2 red herringstask_expertโ Hard (25 steps) โ Multi-fault: Redis + Auth + 2 red herringstask_expert_long_horizonโ Hard (50 steps) โ Long-horizon with latent secondary fault at step 35+
LLM Client Integration (Phase 1-5)
Supported Providers
The environment's LLM client supports multiple providers with failover logic:
| Provider | Models | Retry Logic | Fallback |
|---|---|---|---|
| OpenAI | GPT-4, GPT-4o, GPT-4-turbo | โ Exponential backoff | Anthropic |
| Anthropic | Claude-3, Claude-3.5 | โ Exponential backoff | HuggingFace |
| HuggingFace Router | Qwen, Llama, Mistral | โ Retry on rate limit | Groq |
| Groq | Llama, Mixtral, Gemma | โ Fast (no rate limit) | OpenAI |
Configuration
Set these environment variables to control LLM behavior:
# Required
export API_BASE_URL="https://api.openai.com/v1" # or other provider
export API_KEY="sk_..." # API credentials
# Optional
export MODEL_NAME="gpt-4o" # Model identifier
export RETRY_ATTEMPTS=5 # Max retries
export RETRY_BACKOFF_FACTOR=2 # Exponential backoff multiplier
export TIMEOUT_SECONDS=30 # Request timeout
Resilience Features
- Exponential backoff โ Retries on transient failures (429, 500-599)
- Provider failover โ Falls back to secondary provider on persistent failure
- Timeout handling โ Configurable request timeout with fallback
- Rate-limit aware โ Detects and respects rate-limit headers
๐ฎ Phase 1-2 Simulators โ Experimental Extensions
The codebase includes experimental simulators for additional incident domains. These are not yet integrated into the main environment but are available for future use and currently power the domain-aware reward system backend.
CI/CD Pipeline Simulator (simulators/cicd_simulator.py)
Models GitHub Actions / GitLab CI incidents from real-world 2025โ2026 failure patterns:
Fault Types:
- Secret rotation failures / token expiration
- Corrupt action (tag overwrite, dependency injection)
- OIDC authentication misconfig
- Runner queue saturation (pickup latency)
- Workflow injection via PR title
- Audit log manipulation
Incident Examples:
- "Deploy job failed: Cannot authenticate to AWS (OIDC audience mismatch)"
- "Build stuck in queue: 47 jobs ahead, no idle runners"
- "Production secret not found: last rotation failed 2 hours ago"
Status: Reward dispatch ready. Awaiting simulator โ environment integration.
Kafka Cluster Simulator (simulators/kafka_simulator.py)
Models Apache Kafka message streaming incidents:
Fault Types:
- Consumer lag buildup (lagging consumers)
- Partition replica imbalance
- Broker rebalance storm
- Schema incompatibility
- Message loss/ non-idempotent producer
Incident Examples:
- "Orders topic consumer lag: 1.2M messages. Processing stalled."
- "Partition 3 stuck during rebalance. No leader elected (>10s)"
- "Schema evolution incompatibility detected. Deserialization errors 98%"
Status: Reward dispatch ready. Awaiting simulator โ environment integration.
Future Roadmap
- Phase 1-3: Integrate CI/CD simulator into environment
- Phase 1-4: Integrate Kafka simulator into environment
- Phase 1-5: Multi-incident scenarios (mixing CI/CD + Kafka failures)
Enhanced Service Simulation (Phase 1-5)
Service Health Monitoring
Each service maintains a comprehensive health state including:
- Status โ UP, DEGRADED, DOWN, CRASHED
- CPU & Memory โ Real-time utilization tracking
- Error Rate โ Per-service error percentage
- Latency โ p50, p99 latency percentiles
- Thread Pool โ Active vs. max thread counts
- Connection Pool โ Database connection utilization
Service Registry & Dependency Graph
Services are registered with metadata:
api-gateway (victim only)
โโโ auth-service (victim or culprit)
โโโ order-service (victim or culprit)
โโโ notification-service (victim or culprit)
auth-service
โโโ redis-cache (session data)
โโโ postgres-db (user credentials)
order-service
โโโ redis-cache (order cache)
โโโ postgres-db (order data)
notification-service
โโโ postgres-db (notification queue)
Cascading Failure Simulation
When a service fails, downstream effects propagate:
- Service A fails โ logs show fault
- Clients of A degrade โ metrics show latency spike
- Clients' clients affected โ api-gateway shows worst symptoms
- Timeout cascade โ Red herring: api-gateway looks like culprit
Kafka Simulator (Phase 1-5)
Event streaming tasks use embedded Kafka simulation for incident scenarios involving:
- Event lag buildup โ Consumer lag exceeding thresholds
- Broker rebalancing โ Replica election delays
- Message loss events โ Simulated Kafka rebalance storms
- Topic partition imbalance โ Uneven load distribution
Example task: task_kafka_broker_failure (coming in phase 2)
1. task_cpu_spike โ Auth Service CPU hard loop
| Property | Value |
|---|---|
| Difficulty | Easy |
| Max Steps | 10 |
| Ideal Steps | 5 |
| Fault Service | auth-service |
| Fault Type | cpu_spike |
| Red Herrings | None |
| Key Signal | cpu_pct: 99, Logs: hot loop detected in JWTValidator.validate() |
| Correct Fix | restart_service โ declare_rca |
2. task_db_connection_leak โ Database connection pool exhaustion
| Property | Value |
|---|---|
| Difficulty | Medium |
| Max Steps | 15 |
| Ideal Steps | 6 |
| Fault Service | order-service |
| Fault Type | connection_pool_exhausted |
| Red Herrings | postgres-db (appears to be failing) |
| Key Signal | DB Query: active_connections: 500 / waiting_queries: 847 |
| Correct Fix | run_db_query to confirm โ declare_rca |
3. task_redis_memory_eviction โ Redis cache memory eviction cascade
| Property | Value |
|---|---|
| Difficulty | Medium |
| Max Steps | 15 |
| Ideal Steps | 5 |
| Fault Service | redis-cache |
| Fault Type | memory_eviction |
| Red Herrings | api-gateway |
| Key Signal | Cache miss rate: 89%, API latency high |
| Correct Fix | restart_service โ declare_rca |
4. task_disk_full โ PostgreSQL WAL overflow (ENOSPC)
| Property | Value |
|---|---|
| Difficulty | Easy |
| Max Steps | 10 |
| Ideal Steps | 4 |
| Fault Service | postgres-db |
| Fault Type | disk_full |
| Red Herrings | None |
| Key Signal | Logs: ENOSPC: No space left on device, Metrics: disk_used_pct: 100, wal_size_gb: 48 |
| Correct Fix | declare_rca (disk_full) |
5. task_memory_leak โ Notification Service GC Pauses
| Property | Value |
|---|---|
| Difficulty | Medium |
| Max Steps | 15 |
| Ideal Steps | 6 |
| Fault Service | notification-service |
| Fault Type | memory_leak |
| Red Herrings | None |
| Key Signal | Metrics: memory_pct: 98, gc_pause_ms: 8000-14000, Logs: Email Template Cache holding 3.4GB |
| Correct Fix | restart_service โ declare_rca |
6. task_thread_starvation โ Auth Service Thread Pool Exhaustion (OAuth)
| Property | Value |
|---|---|
| Difficulty | Medium |
| Max Steps | 15 |
| Ideal Steps | 6 |
| Fault Service | auth-service |
| Fault Type | thread_pool_exhausted |
| Red Herrings | None |
| Key Signal | Metrics: thread_pool_active: 200/200, latency_p99_ms: 30000, Logs: OAuthIdentityClient timeout after 30000ms |
| Correct Fix | declare_rca (thread_starvation) |
7. task_canary_poison โ API Gateway v2.1 Strips Auth Headers
| Property | Value |
|---|---|
| Difficulty | Hard |
| Max Steps | 20 |
| Ideal Steps | 5 |
| Fault Service | api-gateway |
| Fault Type | canary_misconfiguration |
| Red Herrings | order-service, auth-service (see 401 errors) |
| Key Signal | 10% of requests fail with 401, Logs: canary v2.1 stripping Authorization header |
| Correct Fix | declare_rca (canary_poison) |
8. task_clock_skew โ NTP Drift, JWT iat Rejected
| Property | Value |
|---|---|
| Difficulty | Hard |
| Max Steps | 20 |
| Ideal Steps | 6 |
| Fault Service | auth-service |
| Fault Type | clock_skew |
| Red Herrings | redis-cache (cache miss rate 68%), order-service (25% rejections) |
| Key Signal | Metrics: clock_drift_seconds: 480, Logs: JWT iat is in the future, NTP daemon not running |
| Correct Fix | declare_rca (clock_skew) |
9. task_expert โ Multi-Root-Cause: Redis + Auth Failures
| Property | Value |
|---|---|
| Difficulty | Expert |
| Max Steps | 25 |
| Ideal Steps | 12 |
| Fault Services | redis-cache (connection_pool_exhausted) AND auth-service (bad_deployment) |
| Cascade | At step 8, api-gateway cascades due to upstream overload |
| Red Herrings | order-service, notification-service |
| Key Signal | Login failures 62%, orders failing 0%, multiple cascading signals |
| Correct Fix | Must identify BOTH redis and auth issues, declare both in RCA |
| Difficulty | Tests: Multi-root-cause diagnosis, understanding cascading signals |
10. task_expert_long_horizon โ Long-Horizon Cascade: Latent Secondary Fault (๐ 50 STEPS)
| Property | Value |
|---|---|
| Difficulty | Expert |
| Max Steps | 50 |
| Ideal Steps | 25 |
| Initial Fault | postgres-db slow_query causing gradual degradation |
| Cascade Trigger | At step 35 (not step 8), order-service cascades |
| Root Cause | Quick restart (step 10โ15) seems to fix it, but introduces query planner bug โ secondary cascade |
| Red Herrings | api-gateway, redis-cache |
| Key Signal | Order latency steadily increasing: 500ms โ 2000ms โ 8000ms |
| Why Long-Horizon | Tests extended state tracking, planning over 50-step trajectory, avoiding quick-fix optimization traps |
| Challenge | Agent must distinguish between immediate symptom fix vs. correct structural fix; latent bug manifests much later |
| Correct Fix | Deep investigation (15โ25 steps) to identify secondary root cause, implement correct fix, avoid cascade |
| Theme Alignment | Hackathon Theme #2: Long-Horizon Planning โ pushes agents beyond shallow next-token reasoning |
Additional Tasks Available (not detailed):
task_api_rate_limit(api-gateway misconfiguration)task_deadlock_order_service(database deadlock)task_ssl_cert_expired(x509 cert expired)task_slow_query_postgres(missing db index)task_auth_service_500(null pointer exception)task_k8s_pod_crashloop(unhandled exception in pod)
Reward Function โ Full Specification
Per-Action Rewards
# read_logs
fault_service: +0.10 # "found fault evidence in logs"
api-gateway: +0.05 # "gateway logs show symptoms"
other: +0.00 # "no relevant signal"
# check_metrics
fault_service: +0.08 # "fault service metrics show anomaly"
red_herring: +0.02 # "suspicious but not fault service"
other: +0.00 # "metrics normal"
# check_health
fault_service (oom): +0.07 # "found downed service"
fault_service (other): +0.05 # "service degraded"
other: +0.00
# run_db_query
connection_pool task: +0.12 # "confirms connection pool exhaustion"
other tasks: +0.01 # "limited signal"
# restart_service
correct service + oom: +0.30 # "correct โ oom_crash resolved"
correct service + wrong: +0.10 # "restarted but wrong fix type"
wrong service: -0.30 # "WRONG TARGET โ serious penalty teaches caution"
# rollback_deployment
correct service + bad_dep: +0.30 # "correct rollback"
correct service + wrong: -0.10 # "rollback completed but wrong fix type"
wrong service: -0.30 # "WRONG TARGET โ serious penalty teaches caution"
# any repeated action
early (before 50%): -0.08 # "early redundancy penalty"
late (after 50%): -0.20 # "late redundancy penalty"
# declare_rca
correct service: 0.50 + efficiency_bonus + evidence_bonus (up to 0.999)
wrong service: -0.40 # Hard penalty for overconfident guessing
# DESIGN PHILOSOPHY: Hard SRE penalties enforce causal reasoning
# Wrong interventions (wrong service/wrong fix) carry real cost (-0.10 to -0.30)
# This forces agents to gather evidence before acting, not blind guessing.
# All rewards are clamped to [0.001, 0.999] per competition rules, so no catastrophic spirals.
Efficiency Bonus (declare_rca only)
efficiency_bonus = max(0.0, (max_steps - step_count) / max_steps) * 0.30
# Example: declare at step 4 of 10 โ (10-4)/10 * 0.30 = 0.18
# Example: declare at step 9 of 10 โ (10-9)/10 * 0.30 = 0.03
Evidence Bonus (declare_rca only)
evidence_bonus = min(len(relevant_evidence_found) * 0.05, 0.20)
# Max: 4 evidence types ร 0.05 = 0.20
Cumulative Reward
cumulative = sum(all_step_rewards) # can be negative due to hard penalties
final_score = max(0.001, min(0.999, cumulative)) # clamped to [0.001, 0.999]
Final Grade
score = final_score # Already clamped to [0.001, 0.999]
success = score >= 0.6 # success threshold
Why Hard Penalties?
The hard penalty design (-0.30 for wrong service, -0.10 for wrong fix type) creates a high-stakes environment that meaningfully tests the agent's behavior:
- Causal reasoning: Agents that read logs/metrics first and then act correctly earn high scores. Agents that guess blindly hit hard penalties.
- Realism: In real SRE, wrong actions have exponential costs (cascading failures, customer impact). Our penalties model this authentically.
- Stability: Hard penalties ensure only systematic approaches achieve competitive scores, not random wandering.
The clamping to [0.001, 0.999] ensures no single mistake is unrecoverable โ agents can learn and succeed in the same episode if they adjust course after a wrong action.
Extending the Environment
Adding a New Task
In environment.py, add to the TASKS dict:
"task_yourname": {
"name": "Human-readable name",
"difficulty": "medium",
"max_steps": 15,
"description": "One sentence description of the fault scenario.",
"alert": "ALERT: What the on-call engineer sees in their pager.",
"fault_service": "service-name", # must be in SERVICES list
"fault_type": "oom_crash" | "cpu_spike" | "connection_pool_exhausted",
"red_herrings": [], # list of service names
"ideal_steps": 4, # expected optimal step count
},
Then update openenv.yaml, inference.py, models.py and dashboard_impl.py's TASKS lists.
Adding a New Fault Type
- Add a new fault type string to the
fault_typefield - Update
_make_metrics()to return anomalous metrics for that fault - Update
_make_logs()to return appropriate log lines - Update the
step()method's remediation logic to handle the new fix action - Update
AGENT.mdfault type table
Adding a New Service
Add to the SERVICES list in environment.py and update openenv.yaml's action space description.
Reproducibility
Set seed in the /reset request for deterministic episodes:
env_reset("task_cpu_spike", seed=42)
This seeds Python's random module, making all _make_metrics() random values deterministic for that episode.
Known Limitations
- Single-instance server: The FastAPI app uses a single global
IncidentResponseEnvinstance. Concurrent agents will interfere with each other. - No persistent storage: Benchmark results are not persisted across container restarts by default.
- Simulated, not live: All metrics and logs are procedurally generated โ not from a real Kubernetes cluster.
- Limited fault diversity: Currently 9 fault types. See
SKILLS.mdfor planned expansions.