RFC 0009: Runtime Container Health Checks
Summary
Add a /_health endpoint to each runtime container and have the ingress poll it after deploy before reporting success.
Problem
After deploying, the ingress assumes the runtime is up. No readiness probe or health check. If the deno container crashes on startup (e.g., bad module import), the ingress happily 502s with no diagnostic.
Files to Modify
- Deno router ā add
/_healthendpoint proxy/runtimes.pyā poll health after deployproxy/deploy.pyā await health check before returning success
Implementation
- Add a
/_healthendpoint to the Deno router that returns 200 OK with{"status":"ok","uptime":<seconds>}when the router is ready to serve requests - In
proxy/runtimes.py, afterrefresh()starts a container, poll/_healthwith exponential backoff:- First poll after 500ms
- Retry up to 10 times with 500ms, 1s, 1s, 2s, 2s, 4s⦠intervals
- Total timeout: 30 seconds
- If health check passes, return success
- If health check fails after 30s, return a clear error with the container logs
- In
proxy/deploy.py, await the health check result before returning 200 to the caller - For existing containers (no restart needed), skip the health check (already healthy)
Testing & Validation Requirements
- Deploy a healthy app. Verify the deploy response comes AFTER the health check passes (not before).
- Deploy an app with a syntax error. Verify the deploy returns 500 with useful error message, not 200.
- Verify health endpoint returns correct uptime (increases over time).
- Verify health check timeout works: deploy to a container that never starts, verify it fails after ~30s.
- Verify existing running containers are NOT re-health-checked on redeploy of a different project.
Report Requirements
- Show the health endpoint code and the polling logic
- Test transcript for healthy deploy showing timing
- Test transcript for failing deploy showing the error message
- Diff of runtimes.py and deploy.py