Día 54 - Alerting Avanzado y SRE Practices
🚨 Alerting Como un SRE de Google
"Las alertas malas despiertan a los engineers a las 3 AM por nada. Las alertas inteligentes solo te despiertan cuando realmente algo crítico está pasando. 🎯"
🎯 Meta del día:
- ✅ Entender SLIs, SLOs y Error Budgets
- ✅ Configurar AlertManager con escalation inteligente
- ✅ Crear alertas que NO generen fatigue
- ✅ Integrar con Slack/Discord para incident response
- ✅ Practicar incident response real
🤔 El Problema: Alert Fatigue
Alertas malas (que todos odian):
🔥 CRITICAL: CPU > 50% (se dispara 100x/día)
🔥 WARNING: Response time > 200ms (ruido constante)
🔥 ERROR: 1 request falló (no es crítico)
Resultado: Engineers ignoran ALL las alertas 😴
Alertas buenas (SRE style):
🚨 CRITICAL: SLO breach - 99.9% availability violated
⚠️ WARNING: Error budget will exhaust in 2 hours at current rate
📊 INFO: Performance degraded but within SLO bounds
Resultado: Engineers responden inmediatamente 🚀
📊 Paso 1: Definir SLIs y SLOs
¿Qué son SLIs, SLOs y SLAs?
SLI (Service Level Indicator) = Métrica que mides
Availability = (successful requests / total requests)
Latency = P99 response time
Throughput = requests per second
SLO (Service Level Objective) = Target que quieres alcanzar
Availability SLO: 99.9% (8.77 hours downtime/year)
Latency SLO: P99 < 500ms
Throughput SLO: Handle 1000 RPS
SLA (Service Level Agreement) = Contractual commitment
"Si no cumplimos 99.9% availability, te devolvemos dinero"
Defining SLOs para tu app:
slos-config.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-definitions
namespace: mi-app
data:
slos.yaml: |
# SLOs para mi-app
services:
mi-app:
availability:
slo: 99.9% # 43 seconds downtime per month
sli: |
rate(app_requests_total{status=~"2.."}[30d]) /
rate(app_requests_total[30d])
latency:
slo: 95% # 95% of requests < 500ms
threshold: 0.5 # 500ms
sli: |
histogram_quantile(0.95,
rate(app_request_duration_seconds_bucket[5m])
)
error_budget:
period: 30d # Monthly error budget
burn_rate_fast: 6 # 6x normal rate = critical
burn_rate_slow: 3 # 3x normal rate = warning
kubectl apply -f slos-config.yaml
🔥 Paso 2: Error Budget y Burn Rate
¿Qué es Error Budget?
Si tu SLO es 99.9% availability, tienes 0.1% error budget:
- En 30 días = 43 minutos de downtime permitido
- Error budget = "cuánto puedes fallar antes de quebrar SLO"
Burn Rate Alerting:
error-budget-alerts.yaml
:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: error-budget-alerts
namespace: mi-app
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: slo.rules
interval: 30s
rules:
# Calculate availability SLI
- record: sli:availability:rate5m
expr: |
rate(app_requests_total{status=~"2.."}[5m]) /
rate(app_requests_total[5m])
- record: sli:availability:rate30m
expr: |
rate(app_requests_total{status=~"2.."}[30m]) /
rate(app_requests_total[30m])
- record: sli:availability:rate6h
expr: |
rate(app_requests_total{status=~"2.."}[6h]) /
rate(app_requests_total[6h])
# Calculate latency SLI
- record: sli:latency:p99:rate5m
expr: |
histogram_quantile(0.99,
rate(app_request_duration_seconds_bucket[5m])
)
# Error budget burn rate
- record: slo:error_budget_burn_rate:5m
expr: |
(1 - sli:availability:rate5m) / (1 - 0.999)
- record: slo:error_budget_burn_rate:30m
expr: |
(1 - sli:availability:rate30m) / (1 - 0.999)
- name: slo.alerts
rules:
# CRITICAL: Fast burn rate (budget exhausted in 2 hours)
- alert: SLOErrorBudgetBurnRateCritical
expr: |
slo:error_budget_burn_rate:5m > 6 and
slo:error_budget_burn_rate:30m > 6
for: 2m
labels:
severity: critical
team: platform
runbook: "https://runbooks.company.com/slo-burn-rate"
annotations:
summary: "🚨 CRITICAL: Error budget burning too fast"
description: |
Service {{ $labels.service }} is burning error budget at {{ $value | humanize }}x
the normal rate. At this rate, monthly error budget will be exhausted in 2 hours.
Current availability: {{ with query "sli:availability:rate5m" }}{{ . | first | value | humanizePercentage }}{{ end }}
SLO: 99.9%
# WARNING: Medium burn rate (budget exhausted in 1 day)
- alert: SLOErrorBudgetBurnRateHigh
expr: |
slo:error_budget_burn_rate:5m > 3 and
slo:error_budget_burn_rate:6h > 3
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "⚠️ WARNING: Error budget burning faster than usual"
description: |
Service {{ $labels.service }} is burning error budget at {{ $value | humanize }}x
the normal rate. At this rate, monthly error budget will be exhausted in 1 day.
# Latency SLO breach
- alert: SLOLatencyBreach
expr: sli:latency:p99:rate5m > 0.5
for: 2m
labels:
severity: warning
team: platform
annotations:
summary: "Latency SLO breach detected"
description: |
P99 latency is {{ $value }}s, exceeding SLO of 500ms.
This affects user experience quality.
# Error budget exhaustion prediction
- alert: SLOErrorBudgetExhaustionRisk
expr: |
predict_linear(slo:error_budget_burn_rate:30m[6h], 7*24*3600) > 1
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Error budget at risk of exhaustion"
description: |
Based on current trends, error budget will be exhausted in less than 7 days.
Consider reducing deployment frequency or improving reliability.
kubectl apply -f error-budget-alerts.yaml
🔔 Paso 3: AlertManager Configuración Inteligente
Configurar AlertManager para routing inteligente:
alertmanager-config.yaml
:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-kube-prometheus-alertmanager
namespace: monitoring
type: Opaque
stringData:
alertmanager.yml: |
global:
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL' # 🔄 Cambiar por tu webhook
resolve_timeout: 5m
# Template para mensajes más informativos
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'team']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default-receiver'
routes:
# Critical alerts -> Inmediato + escalation
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
repeat_interval: 5m
# Warning alerts -> Slack normal
- match:
severity: warning
receiver: 'warning-alerts'
group_wait: 2m
repeat_interval: 1h
# Info alerts -> Solo durante horario laboral
- match:
severity: info
receiver: 'info-alerts'
active_time_intervals:
- business-hours
# Time intervals
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: '09:00'
end_time: '18:00'
weekdays: ['monday:friday']
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#devops-alerts'
title: '🔔 {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Details:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
- name: 'critical-alerts'
slack_configs:
- channel: '#incidents'
title: '🚨 CRITICAL ALERT 🚨'
text: |
@channel IMMEDIATE ATTENTION REQUIRED
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Labels.runbook }}
*Team:* {{ .Labels.team }}
{{ end }}
color: 'danger'
# También enviar por email para critical
email_configs:
- to: 'oncall@company.com'
subject: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
body: |
CRITICAL alert fired:
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}
- name: 'warning-alerts'
slack_configs:
- channel: '#monitoring'
title: '⚠️ Warning Alert'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Details:* {{ .Annotations.description }}
{{ end }}
color: 'warning'
- name: 'info-alerts'
slack_configs:
- channel: '#monitoring-info'
title: '📊 Info Alert'
text: |
{{ range .Alerts }}
{{ .Annotations.summary }}
{{ end }}
color: 'good'
# Inhibition rules - No alertar cosas redundantes
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'namespace', 'pod']
- source_match:
alertname: 'SLOErrorBudgetBurnRateCritical'
target_match:
alertname: 'SLOErrorBudgetBurnRateHigh'
equal: ['service']
kubectl apply -f alertmanager-config.yaml
# Reiniciar AlertManager para cargar nueva config
kubectl rollout restart statefulset/alertmanager-kube-prometheus-alertmanager -n monitoring
📱 Paso 4: Integración con Slack/Discord
Crear Slack Webhook:
- Ve a tu Slack → Apps → Incoming Webhooks
- Add to Slack → Seleccionar canal → Add Incoming WebHooks Integration
- Copiar Webhook URL:
https://hooks.slack.com/services/...
Configurar webhook en AlertManager:
# Editar el secret con tu webhook real
kubectl edit secret alertmanager-kube-prometheus-alertmanager -n monitoring
# Cambiar YOUR_SLACK_WEBHOOK_URL por tu URL real
Discord alternative:
# En lugar de slack_configs, usar webhook_configs para Discord
webhook_configs:
- url: 'YOUR_DISCORD_WEBHOOK_URL'
title: '🚨 Alert from Kubernetes'
send_resolved: true
🎮 Paso 5: Runbooks y Incident Response
Crear Runbooks estructura:
runbooks-configmap.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: runbooks
namespace: mi-app
data:
slo-burn-rate.md: |
# SLO Error Budget Burn Rate - Incident Response
## Immediate Actions (< 5 minutes)
1. **Assess Impact**: Check current error rate and affected users
```bash
# Current error rate
kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
promtool query instant 'rate(app_requests_total{status!~"2.."}[5m])'
# Affected pods
kubectl get pods -n mi-app --field-selector=status.phase!=Running
-
Check Recent Changes:
- Recent deployments:
kubectl rollout history deployment/mi-app -n mi-app
- Recent configs:
kubectl get events -n mi-app --sort-by=.metadata.creationTimestamp
- Recent deployments:
-
Quick Mitigation:
# If recent deployment caused it, rollback immediately
kubectl rollout undo deployment/mi-app -n mi-app
# If traffic spike, scale up
kubectl scale deployment mi-app --replicas=10 -n mi-app
Investigation (< 15 minutes)
- Check Dependencies: Database, external APIs, network
- Resource Analysis: CPU, Memory, Network saturation
- Log Analysis:
kubectl logs -n mi-app -l app=mi-app --tail=100 | grep ERROR
Long-term Fix
- Root Cause Analysis: Complete RCA document
- Preventive Measures: Add missing alerts, improve tests
- Post-mortem: Schedule blameless post-mortem
high-latency.md: |
High Latency Incident Response
Immediate Checks
- Database Performance: Check slow queries, connections
- External Dependencies: API response times, timeouts
- Resource Saturation: CPU throttling, memory pressure
Quick Fixes
# Scale application
kubectl scale deployment mi-app --replicas=5 -n mi-app
# Check resource limits
kubectl describe pod -n mi-app -l app=mi-app | grep -A 5 Limits
# Restart if memory leak suspected
kubectl rollout restart deployment/mi-app -n mi-app
app-down.md: |
Application Down - Critical Incident
Emergency Response (< 2 minutes)
-
Check Pod Status:
kubectl get pods -n mi-app -o wide
kubectl describe pod -n mi-app -l app=mi-app -
Check Service/Ingress:
kubectl get svc,ingress -n mi-app
curl -I http://localhost:30080/health -
Immediate Recovery:
# Force restart all pods
kubectl rollout restart deployment/mi-app -n mi-app
# If persistent, rollback to last known good
kubectl rollout undo deployment/mi-app -n mi-app
Escalation Path
- < 5 min: Platform team lead
- < 15 min: Engineering manager
- < 30 min: VP Engineering
```bash
kubectl apply -f runbooks-configmap.yaml
🧪 Paso 6: Chaos Testing - Simular Incidentes
Test 1: Error Budget Burn
# Script para generar errores masivos
cat > chaos-errors.sh << 'EOF'
#!/bin/bash
echo "🔥 Generando errores para testing SLO alerts..."
# Generar tráfico que causa muchos 500s
for i in {1..200}; do
# El endpoint /api/users tiene 10% de probabilidad de error
# Con 200 requests, deberíamos tener ~20 errores
curl -s http://localhost:30080/api/users > /dev/null &
done
wait
echo "✅ Chaos test completado. Revisar alertas en 2-3 minutos."
EOF
chmod +x chaos-errors.sh
./chaos-errors.sh
Observar:
- Error rate spike en Grafana
- SLO burn rate alert disparándose
- Slack notification llegando
Test 2: Latency SLO Breach
# Stress test que causa alta latencia
cat > chaos-latency.sh << 'EOF'
#!/bin/bash
echo "⏰ Generando latencia alta..."
# Bombardear con requests para causar latencia
for round in {1..5}; do
echo "Round $round/5"
for i in {1..50}; do
curl -s http://localhost:30080/api/users > /dev/null &
done
sleep 2
done
wait
echo "✅ Latency chaos test completado."
EOF
chmod +x chaos-latency.sh
./chaos-latency.sh
Test 3: Complete App Down
# Simular app completamente caída
kubectl scale deployment mi-app --replicas=0 -n mi-app
# Esperar 2 minutos para que las alertas se disparen
sleep 120
# Restaurar
kubectl scale deployment mi-app --replicas=3 -n mi-app
📊 Paso 7: Dashboard Ejecutivo de SLOs
SLO Dashboard para management:
Crear dashboard en Grafana con estos panels:
Panel 1: SLO Compliance (Current Month)
sli:availability:rate30m * 100
- Type: Stat
- Title: "Current Availability"
- Thresholds: Red < 99.9%, Yellow < 99.95%, Green >= 99.95%
Panel 2: Error Budget Remaining
(1 - (1 - sli:availability:rate30m) / (1 - 0.999)) * 100
- Type: Gauge
- Title: "Error Budget Remaining %"
- Max: 100
Panel 3: Error Budget Burn Rate
slo:error_budget_burn_rate:5m
- Type: Time series
- Title: "Error Budget Burn Rate (1.0 = normal)"
Panel 4: SLO History (Last 7 days)
avg_over_time(sli:availability:rate30m[7d]) * 100
Panel 5: MTTR (Mean Time To Recovery)
avg(time() - on() (up == 0))
🚨 Paso 8: Escalation y On-Call Rotation
PagerDuty Integration (opcional):
# En alertmanager config
pagerduty_configs:
- severity: '{{ .CommonLabels.severity }}'
service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
On-Call Schedule:
oncall-schedule.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: oncall-schedule
namespace: monitoring
data:
schedule.json: |
{
"rotation": "weekly",
"timezone": "America/Argentina/Buenos_Aires",
"schedule": [
{
"week": 1,
"primary": "juan@company.com",
"secondary": "maria@company.com"
},
{
"week": 2,
"primary": "carlos@company.com",
"secondary": "ana@company.com"
}
],
"escalation": [
{"level": 1, "delay": "5m", "target": "primary"},
{"level": 2, "delay": "15m", "target": "secondary"},
{"level": 3, "delay": "30m", "target": "manager@company.com"}
]
}
📚 Best Practices para Alerting
🎯 Alert Design Principles:
- Actionable: Cada alerta debe tener acción clara
- Meaningful: Solo alertar lo que realmente importa
- Contextual: Incluir información para resolver
- Escalated: Critical → Warning → Info
- Time-bound: Resolver automáticamente cuando corresponde
❌ Anti-patterns a evitar:
# MAL: Alert por cada error individual
- alert: SingleRequestFailed
expr: app_requests_total{status="500"} > 0
# BIEN: Alert por rate de errores sostenido
- alert: HighErrorRate
expr: rate(app_requests_total{status="500"}[5m]) > 0.1
for: 2m
📝 Alert Annotation Template:
annotations:
summary: "Brief, clear description of the problem"
description: |
Detailed explanation including:
- What is happening
- Why it matters
- What to do about it
- Links to runbooks/dashboards
runbook_url: "https://runbooks.company.com/{{ $labels.alertname }}"
dashboard_url: "http://grafana.company.com/d/app-dashboard"
🛠️ Troubleshooting Alerting
❌ "Alertas no llegan a Slack"
# 1. Verificar config de AlertManager
kubectl logs -n monitoring alertmanager-kube-prometheus-alertmanager-0
# 2. Test webhook manualmente
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Test from AlertManager"}' \
YOUR_SLACK_WEBHOOK_URL
# 3. Verificar routing en AlertManager UI
# http://localhost:30002
❌ "Alertas se disparan constantemente"
# Revisar thresholds y duración
kubectl get prometheusrule -n mi-app -o yaml
# Check for flapping metrics
kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
promtool query instant 'YOUR_ALERT_EXPRESSION'
🧠 Revisión del Día
Concepto | ¿Lo lograste? | Notas |
---|---|---|
SLIs y SLOs definidos | ✔️ / ❌ | |
Error budget alerting funcionando | ✔️ / ❌ | |
AlertManager con routing inteligente | ✔️ / ❌ | |
Integración Slack/Discord | ✔️ / ❌ | |
Runbooks documentados | ✔️ / ❌ | |
Chaos testing ejecutado | ✔️ / ❌ |
🎯 Pro Challenges
Challenge 1: Multi-Service SLOs
Definir SLOs para un stack completo (frontend + backend + database).
Challenge 2: Business KPI Alerts
Crear alertas basadas en métricas de negocio (revenue, signups, etc).
Challenge 3: Predictive Alerting
Usar predict_linear()
para alertar ANTES de que ocurran problemas.
💡 Pro Tips SRE
🎯 SLO Strategy:
- Start conservative: 99.9% es mejor que 99.99% inalcanzable
- Measure what users experience: User-facing metrics
- Error budget is a feature: Permite deployments y riesgos
🚨 Alert Fatigue Prevention:
- Alert on symptoms, not causes: Error rate, not disk space
- Use inhibition rules: No duplicar alertas
- Time-based routing: Info alerts solo en horario laboral
📊 Dashboard for Audiences:
- Executive: SLO compliance, business impact
- Engineering: Technical metrics, troubleshooting
- On-call: Current incidents, recent changes
🚀 ¿Qué Sigue Mañana?
Días 55-56: Proyecto Voting App con Observabilidad Completa
- Aplicación multi-componente real
- SLOs end-to-end
- Observabilidad distributed tracing
- Demo completa de incident response
Mantener el setup:
# Todo este stack será la base para el proyecto final
kubectl get all -A | grep -E "(prometheus|grafana|alertmanager)"
🎉 ¡SRE Level Unlocked!
Hoy completaste el círculo de observabilidad profesional:
✅ SLIs/SLOs/SLAs - Como Google/Netflix/Uber
✅ Error budget methodology - Data-driven reliability
✅ Intelligent alerting - Sin alert fatigue
✅ Incident response - Runbooks y escalation
✅ Chaos engineering - Proactive reliability testing
✅ Multi-audience dashboards - Executive + Engineering
Skill Level: De "monitoring básico" a "Site Reliability Engineer" 📈
Esto te pone en el top 1% de engineers que dominan observabilidad enterprise.
📸 Compartí tus SLO dashboards y alertas inteligentes #DevOpsConRoxs - Día 54
¡Mañana construimos el proyecto final que va a ser tu portfolio showcase! 🎮🚀