What 99.9% Uptime Actually Means for an AI Agent

99.9% uptime means your agent can be down for 8 hours and 46 minutes per year. That sounds fine until you think about when those 8 hours happen.

If your AI agent is the primary way customers reach your business on WhatsApp, and it goes down on a Friday evening, you might not notice until Monday morning. That's not 8 hours of downtime. That's 60 hours of silence to every customer who tried to reach you over the weekend.

Why agents crash

Most agent crashes are not dramatic failures. They're quiet. The process runs out of memory after handling a long conversation. The WebSocket connection to Telegram drops and the reconnection logic has a bug. The server runs a kernel update and the process doesn't restart. The disk fills up with conversation logs nobody thought to rotate.

These failures share a pattern: they're not detectable from outside the server unless you're actively monitoring. Your agent doesn't send an error message to your customers — it just stops responding. The customer thinks you're ignoring them. You think everything is fine because nobody told you otherwise.

What monitoring actually requires

Knowing your agent is down requires three things: a health check that runs frequently (every 30 seconds, not every 5 minutes), an alerting system that can reach you (email, SMS, push notification), and an automatic restart mechanism that doesn't wait for human intervention.

Most self-hosted setups have a process manager like systemd or PM2 that handles restarts. That covers the simple cases — process crashes, OOM kills. It doesn't cover the subtle cases: the agent is running but not processing messages, the WebSocket is connected but not receiving, the API key expired and every response is an error.

Real monitoring means checking that the agent is not just running, but functioning. Can it receive a message? Can it generate a response? Is it responding within acceptable latency? These checks require application-level monitoring, not just process-level.

How we handle it

Every agent we host has three layers of monitoring. The process manager restarts on crash. A health check endpoint verifies the agent can actually process messages. An external monitor pings the health check every 30 seconds and alerts us if it fails twice consecutively.

When an agent goes down, the typical recovery time is under 60 seconds. Most of our clients have never experienced an outage they noticed. The ones that have were resolved before they finished composing the email to ask us about it.

Uptime is not a feature you think about until you don't have it. By then, the damage is usually done — missed leads, frustrated customers, and a loss of trust in the system that's supposed to be working while you sleep.