最近在学习Kubernetes,发现这东西的复杂程度比Swarm高了不止一个量级。而且官方文档不够细致,日志也不够明晰,排错很困难。看来云原生的门槛还挺高的。
在搭建实验环境的时候,遇到了2个故障:
kubernetes-dashboard
pod一直CrashLoopBackOffcontroller-manager-master
pod一直CrashLoopBackOff
在Google搜了一圈,发现这俩都是常见病,但就是找不到能解决它们的答案。花费很长时间之后,才找到解法。在这里做个笔记,免得以后忘记了。
Issue 1: kubernetes-dashboard
pod keeps crashing with "CrashLoopBackOff"
kubeadm init
的时候,指定了pod使用的IP地址段--pod-network-cidr=10.9.0.0/16
[root@master ~]# kubeadm init --kubernetes-version=1.22.2 \
> --apiserver-advertise-address=192.168.200.10 \
> --image-repository registry.aliyuncs.com/google_containers \
> --service-cidr=10.8.0.0/16 \
> --pod-network-cidr=10.9.0.0/16
然后也没研究flannel的配置文件,直接就按照官方文档给的命令安装flannel(所以就掉坑里了)。安装顺利,也没报错。
[root@master ~]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created
[root@master ~]# ip a | grep flannel
12: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
inet 10.9.0.0/32 brd 10.9.0.0 scope global flannel.1
所有node状态正常,kube-system
namespace的所有pod也运行正常
[root@master ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master Ready control-plane,master 19m v1.22.2
node1 Ready <none> 5m51s v1.22.2
node2 Ready <none> 3m48s v1.22.2
[root@master ~]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-7f6cbbb7b8-hqcxw 1/1 Running 0 47m
coredns-7f6cbbb7b8-sgmrq 1/1 Running 0 47m
etcd-master 1/1 Running 0 47m
kube-apiserver-master 1/1 Running 0 48m
kube-controller-manager-master 1/1 Running 0 47m
kube-flannel-ds-5n7l5 1/1 Running 0 37m
kube-flannel-ds-h6j5t 1/1 Running 0 33m
kube-flannel-ds-qfpn4 1/1 Running 0 31m
kube-proxy-98s2c 1/1 Running 0 47m
kube-proxy-h4bzm 1/1 Running 0 33m
kube-proxy-kk8hw 1/1 Running 0 31m
kube-scheduler-master 1/1 Running 0 47m
但是安装kubernetes-dashboard
之后,却发现pod一直restart,状态是CrashLoopBackOff
[root@master ~]# kubectl get pods,services -n kubernetes-dashboard
NAME READY STATUS RESTARTS AGE
pod/dashboard-metrics-scraper-856586f554-zrcch 1/1 Running 0 4m22s
pod/kubernetes-dashboard-7b9b87bb74-mdrrb 0/1 CrashLoopBackOff 4 (12s ago) 4m22s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/dashboard-metrics-scraper ClusterIP 10.8.231.74 <none> 8000/TCP 4m26s
service/kubernetes-dashboard ClusterIP 10.8.153.32 <none> 443/TCP 4m26s
使用kubectl describe pod
命令看不到有价值的信息。用kubectl logs
可以看到kubernetes-dashboard
无法连接api-server
。
[root@master ~]# kubectl logs kubernetes-dashboard-7b9b87bb74-mdrrb -n kubernetes-dashboard
2021/10/10 01:28:35 Using namespace: kubernetes-dashboard
2021/10/10 01:28:35 Using in-cluster config to connect to apiserver
2021/10/10 01:28:35 Using secret token for csrf signing
2021/10/10 01:28:35 Initializing csrf token from kubernetes-dashboard-csrf secret
2021/10/10 01:28:35 Starting overwatch
panic: Get "https://10.8.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf": dial tcp 10.8.0.1:443: i/o timeout
看来是网络出问题了。因为系统之前安装了Swarm,担心VXLAN 4789端口冲突,所以把Swarm卸载掉,但没有帮助。
折腾了很久,才发现在Flannel的配置文件里面定义了pod使用的IP地址段为10.224.0.0/16
。而我在kubeadm init
的时候,指定了另外一个地址段--pod-network-cidr=10.9.0.0/16
,所以网络异常,pods无法访问API Server。
net-conf.json: |
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
解决办法是重新kubeadm init
,或者卸载Flannel换用其他的Addon。当然最简单的方法是修改Flannel的配置文件,把上面的10.224.0.0/16
改成我正在使用的10.9.0.0/16
:
[root@master ~]# kubectl edit cm -n kube-system kube-flannel-cfg
configmap/kube-flannel-cfg edited
然后删除Flannel的pods,让kubernetes自动重建pods,就可以让新的配置文件生效。
[root@master ~]# kubectl delete pod -n kube-system -l app=flannel
pod "kube-flannel-ds-5n7l5" deleted
pod "kube-flannel-ds-h6j5t" deleted
pod "kube-flannel-ds-qfpn4" deleted
接下来,再删除kubernetes-dashboard
pod,自动重建的新的pod就可以正常运行。最后再继续配置NodePort,就可以打开Dashboard GUI。
[root@master ~]# kubectl get pods -n kubernetes-dashboard
NAME READY STATUS RESTARTS AGE
pod/dashboard-metrics-scraper-856586f554-8bg48 1/1 Running 0 7s
pod/kubernetes-dashboard-7b9b87bb74-jsdl4 1/1 Running 0 7s
Issue 2: controller-manager-master
pod keeps crashing with "CrashLoopBackOff"
解决kubernetes-dashboard
的故障之后,又发现controller-manager-master
的状态也变成CrashLoopBackOff
了。估计是删除Swarm重启Docker导致的。
[root@master ~]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
...(Snip)
kube-controller-manager-master 0/1 CrashLoopBackOff 15 (90s ago) 13h
...(Snip)
检查log,发现本地回环地址的端口被占用了:
[root@master ~]# kubectl logs kube-controller-manager-master -n kube-system
Flag --port has been deprecated, This flag has no effect now and will be removed in v1.24.
I1010 02:15:57.910606 1 serving.go:347] Generated self-signed cert in-memory
failed to create listener: failed to listen on 127.0.0.1:10257: listen tcp 127.0.0.1:10257: bind: address already in use
就是被Docker重启前的kube-controller
进程占用的,没有释放:
[root@master ~]# netstat -nap | grep 10257
tcp 0 0 127.0.0.1:10257 0.0.0.0:* LISTEN 5686/kube-controlle
杀掉进程,然后删除kube-controller-manager-master
pod:
[root@master ~]# kill -9 5686
[root@master ~]# kubectl delete pod kube-controller-manager-master -n kube-system
pod "kube-controller-manager-master" deleted
Pod自动重建之后,稍等一两分钟,恢复正常:
[root@master kubeblog]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
...(Snip)
kube-controller-manager-master 0/1 Running 19 (5m14s ago) 59s
...(Snip)
参考资料
[1] https://goteleport.com/blog/kubernetes-flannel-dashboard-bomb/ [2] https://medium.com/@deepeshtripathi/kubernetes-controller-pod-crashloopbackoff-resolved-16aaa1c27cfc




