kaisawind's blog

k8s批量删除Evicted pod

注意: Kubernetes版本更新较快，建议参考官方文档了解最新Pod生命周期。

问题背景

Evicted Pod是Kubernetes中因资源不足（如磁盘压力、内存不足）被驱逐的Pod。这些Pod会残留在集群中，占用资源并影响集群管理。

查看Evicted Pod：

kubectl get pods --all-namespaces | grep Evicted

解决方法

方法1：批量删除（指定命名空间）

# 设置命名空间
ns=default

# 删除Evicted Pod
kubectl get pods -n ${ns} | grep Evicted | awk '{print $1}' | xargs kubectl delete pod -n ${ns}

方法2：删除所有命名空间的Evicted Pod

# 删除所有命名空间的Evicted Pod
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.status.reason=="Evicted") | "kubectl delete pod \(.metadata.name) -n \(.metadata.namespace)"' | sh

# 或使用grep和awk
kubectl get pods --all-namespaces | grep Evicted | awk '{print "kubectl delete pod " $2 " -n " $1}' | sh

方法3：使用JSON Path

# 使用JSON Path删除
kubectl delete pod --field-selector=status.phase=Failed -n <namespace>

# 删除所有Failed状态的Pod
kubectl delete pods --all-namespaces --field-selector=status.phase=Failed

方法4：使用kubectl插件

# 创建别名
alias kdel-evicted='kubectl get pods --all-namespaces | grep Evicted | awk '\''{print "kubectl delete pod " $2 " -n " $1}'\'' | sh'

# 使用别名
kdel-evicted

预防Pod Eviction

1. 资源配置

为Pod设置合理的资源需求和限制：

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:latest
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"

2. 节点资源预留

确保节点有足够的可用资源：

# 查看节点资源使用情况
kubectl describe node <node-name>

# 查看节点压力
kubectl top node

3. 配置Pod优先级

关键应用使用更高的优先级：

spec:
  priorityClassName: high-priority

4. 配置Pod分布策略

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule

自动化清理

CronJob定期清理

apiVersion: batch/v1
kind: CronJob
metadata:
  name: clean-evicted-pods
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"  # 每6小时清理一次
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: pod-cleaner
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - kubectl delete pods --all-namespaces --field-selector=status.phase=Failed
          restartPolicy: OnFailure

脚本自动化

#!/bin/bash
# clean-evicted-pods.sh

echo "Cleaning Evicted Pods..."
kubectl get pods --all-namespaces --field-selector=status.phase=Failed -o json | \
  jq '.items[] | "kubectl delete pod \(.metadata.name) -n \(.metadata.namespace)"' | \
  sh

echo "Done!"

排查Eviction原因

# 查看Pod详细信息
kubectl describe pod <pod-name> -n <namespace>

# 查看Pod事件
kubectl get events --sort-by='.lastTimestamp' -n <namespace>

# 查看节点状态
kubectl describe node <node-name> | grep -A 5 -B 5 "Allocated resources"

# 查看集群资源
kubectl top pods --all-namespaces
kubectl top nodes

最佳实践

监控资源使用：使用Prometheus监控资源使用率
设置警报：当资源使用率过高时发送警报
定期清理：使用CronJob定期清理Failed Pod
资源配额：为Pod设置合理的resource requests和limits
优化调度：配置Pod优先级和分布策略

k8s批量删除Evicted pod - Mon, Jan 23, 2023

问题背景

解决方法

方法1：批量删除（指定命名空间）

方法2：删除所有命名空间的Evicted Pod

方法3：使用JSON Path

方法4：使用kubectl插件

预防Pod Eviction

1. 资源配置

2. 节点资源预留

3. 配置Pod优先级

4. 配置Pod分布策略

自动化清理

CronJob定期清理

脚本自动化

排查Eviction原因

最佳实践