某套运行正常的tidb集群,重启主机后,tidb启动报错“Err: connection error”,经排查防火墙开机自启动,各端口通讯异常,从而导致启动失败。


0.ENV
tidb v5.2.1

1. 重启集群报错
重启集群,tidb启动失败:failed to start: failed to start tidb: failed to start。
[root@tidb1 ~]# tiup cluster restart tidb-test...Starting component tidbStarting instance 192.168.80.141:4000Error: failed to start: failed to start tidb: failed to start: 192.168.80.141 tidb-4000.service, please check the instance's log(/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0sVerbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2022-01-26-10-18-20.log.Found cluster newer version:The latest version: v1.8.2Local installed version: v1.6.1Update current component: tiup update clusterUpdate all components: tiup update --allError: run `/root/.tiup/components/cluster/v1.6.1/tiup-cluster` (wd:/root/.tiup/data/SvaM07Y) failed: exit status 1

2. 问题分析
1) 查看报错组件日志
登录tidb主机192.168.80.141,查看日志tidb.log,提示有“Err: connection error”。
[tidb@tidb1 log]$ tail -f /tidb-deploy/tidb-4000/log/tidb.log...1127789 [2022/01/26 10:20:31.787 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]1127790 [2022/01/26 10:20:31.787 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]1127791 [2022/01/26 10:20:31.822 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]1127792 [2022/01/26 10:20:31.822 +08:00] [INFO] [grpclogger.go:69] ["Subchannel picks a new address \"192.168.80.139:20160\" to connect"] [system=grpc] [grpc_log=true]1127793 [2022/01/26 10:20:31.822 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]1127794 [2022/01/26 10:20:31.823 +08:00] [WARN] [grpclogger.go:81] ["grpc: addrConn.createTransport failed to connect to {192.168.80.139:20160 <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp 172.1 8.33.139:20160: connect: no route to host\". Reconnecting..."] [system=grpc] [grpc_log=true]1127795 [2022/01/26 10:20:31.823 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]1127796 [2022/01/26 10:20:31.823 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]1127797 [2022/01/26 10:20:32.011 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]1127798 [2022/01/26 10:20:32.011 +08:00] [INFO] [grpclogger.go:69] ["Subchannel picks a new address \"192.168.80.138:20160\" to connect"] [system=grpc] [grpc_log=true]1127799 [2022/01/26 10:20:32.012 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]1127800 [2022/01/26 10:20:32.012 +08:00] [WARN] [grpclogger.go:81] ["grpc: addrConn.createTransport failed to connect to {192.168.80.138:20160 <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp 172.1 8.33.138:20160: connect: no route to host\". Reconnecting..."] [system=grpc] [grpc_log=true]1127801 [2022/01/26 10:20:32.012 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]1127802 [2022/01/26 10:20:32.012 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]1127803 [2022/01/26 10:20:32.331 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]1127804 [2022/01/26 10:20:32.331 +08:00] [INFO] [grpclogger.go:69] ["Subchannel picks a new address \"192.168.80.140:20160\" to connect"] [system=grpc] [grpc_log=true]1127805 [2022/01/26 10:20:32.332 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]1127806 [2022/01/26 10:20:32.893 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=192.168.80.138:20160] [forwardedHost=] [error="context deadline exceeded"]1127807 [2022/01/26 10:20:32.893 +08:00] [INFO] [grpclogger.go:69] ["Channel Created"] [system=grpc] [grpc_log=true]1127808 [2022/01/26 10:20:32.893 +08:00] [INFO] [grpclogger.go:69] ["parsed scheme: \"\""] [system=grpc] [grpc_log=true]1127809 [2022/01/26 10:20:32.893 +08:00] [INFO] [grpclogger.go:69] ["scheme \"\" not registered, fallback to default scheme"] [system=grpc] [grpc_log=true]1127810 [2022/01/26 10:20:32.894 +08:00] [INFO] [grpclogger.go:69] ["ccResolverWrapper: sending update to cc: {[{192.168.80.138:20160 <nil> 0 <nil>}] <nil> <nil>}"] [system=grpc] [grpc_log=true]1127811 [2022/01/26 10:20:32.894 +08:00] [INFO] [grpclogger.go:69] ["Resolver state updated: {Addresses:[{Addr:192.168.80.138:20160 ServerName: Attributes:<nil> Type:0 Metadata:<nil>}] ServiceConfig:<nil> Attributes:<nil>} (resolver retu rned new addresses)"] [system=grpc] [grpc_log=true]
2) 问题分析
初步判断是网络通信问题或防火墙问题

3. 问题处理
3.1 网络验证
各节点ping测试,网络连接正常。
3.2 防火墙验证
1) 关闭防火墙
查看各节点,防火墙均已打开
[root@tikv1 ~]# systemctl status firewalld● firewalld.service - firewalld - dynamic firewall daemonLoaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)Active: active (running) since Wed 2022-01-26 10:01:17 CST; 26min agoDocs: man:firewalld(1)Main PID: 9884 (firewalld)CGroup: /system.slice/firewalld.service└─9884 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopidJan 26 10:01:16 tikv1 systemd[1]: Starting firewalld - dynamic firewall daemon...Jan 26 10:01:17 tikv1 systemd[1]: Started firewalld - dynamic firewall daemon.
2) 停止防火墙
关闭防火墙
[root@tikv1 ~]# systemctl stop firewalld
关闭开机启动
[root@tikv1 ~]# systemctl disable firewalld
3) 查看防火墙状态-已关闭
[root@tikv1 ~]# systemctl status firewalld● firewalld.service - firewalld - dynamic firewall daemonLoaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)Active: inactive (dead) since Wed 2022-01-26 10:28:30 CST; 55s agoDocs: man:firewalld(1)Process: 9884 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)Main PID: 9884 (code=exited, status=0/SUCCESS)Jan 26 10:01:16 tikv1 systemd[1]: Starting firewalld - dynamic firewall daemon...Jan 26 10:01:17 tikv1 systemd[1]: Started firewalld - dynamic firewall daemon.Jan 26 10:28:30 tikv1 systemd[1]: Stopping firewalld - dynamic firewall daemon...Jan 26 10:28:30 tikv1 systemd[1]: Stopped firewalld - dynamic firewall daemon.

4. 再次重启集群-OK
1) 再次启动防火墙-OK
[root@tidb1 ~]# tiup cluster restart tidb-test...(skip)
2) 查看防火墙状态-OK
[root@tidb1 ~]# tiup cluster display tidb-testStarting component `cluster`: /root/.tiup/components/cluster/v1.6.1/tiup-cluster display tidb-testCluster type: tidbCluster name: tidb-testCluster version: v5.2.1Deploy user: tidbSSH type: builtinDashboard URL: http://192.168.80.135:2379/dashboardID Role Host Ports OS/Arch Status Data Dir Deploy Dir-- ---- ---- ----- ------- ------ -------- ----------192.168.80.135:9093 alertmanager 192.168.80.135 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093192.168.80.135:3000 grafana 192.168.80.135 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000192.168.80.135:2379 pd 192.168.80.135 2379/2380 linux/x86_64 Up|UI /tidb-data/pd-2379 /tidb-deploy/pd-2379192.168.80.136:2379 pd 192.168.80.136 2379/2380 linux/x86_64 Up|L /tidb-data/pd-2379 /tidb-deploy/pd-2379192.168.80.137:2379 pd 192.168.80.137 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379192.168.80.135:9090 prometheus 192.168.80.135 9090 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090192.168.80.141:4000 tidb 192.168.80.141 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000192.168.80.138:20160 tikv 192.168.80.138 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160192.168.80.139:20160 tikv 192.168.80.139 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160192.168.80.140:20160 tikv 192.168.80.140 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160Total nodes: 10Found cluster newer version:The latest version: v1.8.2Local installed version: v1.6.1Update current component: tiup update clusterUpdate all components: tiup update --all
旨在交流,不足之处,还望抛砖。
作者:王坤,微信公众号:rundba,欢迎转载,转载请注明出处。
如需公众号转发,请联系wx: landnow。

文章转载自rundba,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。





