暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

0128.T tidb集群启动报错处理记录_failed to start tidb_Err connection error

rundba 2022-01-27
4519

某套运行正常的tidb集群,重启主机后,tidb启动报错“Err: connection error”,经排查防火墙开机自启动,各端口通讯异常,从而导致启动失败。


0.ENV

tidb v5.2.1


1. 重启集群报错

重启集群,tidb启动失败:failed to start: failed to start tidb: failed to start。

[root@tidb1 ~]# tiup cluster restart tidb-test
...
Starting component tidb
Starting instance 192.168.80.141:4000


Error: failed to start: failed to start tidb: failed to start: 192.168.80.141 tidb-4000.service, please check the instance's log(/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s


Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2022-01-26-10-18-20.log.


Found cluster newer version:
The latest version: v1.8.2
Local installed version: v1.6.1
Update current component: tiup update cluster
Update all components: tiup update --all
Error: run `/root/.tiup/components/cluster/v1.6.1/tiup-cluster` (wd:/root/.tiup/data/SvaM07Y) failed: exit status 1


2. 问题分析

1) 查看报错组件日志

登录tidb主机192.168.80.141,查看日志tidb.log,提示有“Err: connection error”。

[tidb@tidb1 log]$ tail -f /tidb-deploy/tidb-4000/log/tidb.log 
...
1127789 [2022/01/26 10:20:31.787 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]
1127790 [2022/01/26 10:20:31.787 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]
1127791 [2022/01/26 10:20:31.822 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]
1127792 [2022/01/26 10:20:31.822 +08:00] [INFO] [grpclogger.go:69] ["Subchannel picks a new address \"192.168.80.139:20160\" to connect"] [system=grpc] [grpc_log=true]
1127793 [2022/01/26 10:20:31.822 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]
1127794 [2022/01/26 10:20:31.823 +08:00] [WARN] [grpclogger.go:81] ["grpc: addrConn.createTransport failed to connect to {192.168.80.139:20160 <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp 172.1 8.33.139:20160: connect: no route to host\". Reconnecting..."] [system=grpc] [grpc_log=true]
1127795 [2022/01/26 10:20:31.823 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]
1127796 [2022/01/26 10:20:31.823 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]
1127797 [2022/01/26 10:20:32.011 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]
1127798 [2022/01/26 10:20:32.011 +08:00] [INFO] [grpclogger.go:69] ["Subchannel picks a new address \"192.168.80.138:20160\" to connect"] [system=grpc] [grpc_log=true]
1127799 [2022/01/26 10:20:32.012 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]
1127800 [2022/01/26 10:20:32.012 +08:00] [WARN] [grpclogger.go:81] ["grpc: addrConn.createTransport failed to connect to {192.168.80.138:20160 <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp 172.1 8.33.138:20160: connect: no route to host\". Reconnecting..."] [system=grpc] [grpc_log=true]
1127801 [2022/01/26 10:20:32.012 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]
1127802 [2022/01/26 10:20:32.012 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to TRANSIENT_FAILURE"] [system=grpc] [grpc_log=true]
1127803 [2022/01/26 10:20:32.331 +08:00] [INFO] [grpclogger.go:69] ["Subchannel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]
1127804 [2022/01/26 10:20:32.331 +08:00] [INFO] [grpclogger.go:69] ["Subchannel picks a new address \"192.168.80.140:20160\" to connect"] [system=grpc] [grpc_log=true]
1127805 [2022/01/26 10:20:32.332 +08:00] [INFO] [grpclogger.go:69] ["Channel Connectivity change to CONNECTING"] [system=grpc] [grpc_log=true]
1127806 [2022/01/26 10:20:32.893 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=192.168.80.138:20160] [forwardedHost=] [error="context deadline exceeded"]
1127807 [2022/01/26 10:20:32.893 +08:00] [INFO] [grpclogger.go:69] ["Channel Created"] [system=grpc] [grpc_log=true]
1127808 [2022/01/26 10:20:32.893 +08:00] [INFO] [grpclogger.go:69] ["parsed scheme: \"\""] [system=grpc] [grpc_log=true]
1127809 [2022/01/26 10:20:32.893 +08:00] [INFO] [grpclogger.go:69] ["scheme \"\" not registered, fallback to default scheme"] [system=grpc] [grpc_log=true]
1127810 [2022/01/26 10:20:32.894 +08:00] [INFO] [grpclogger.go:69] ["ccResolverWrapper: sending update to cc: {[{192.168.80.138:20160 <nil> 0 <nil>}] <nil> <nil>}"] [system=grpc] [grpc_log=true]
1127811 [2022/01/26 10:20:32.894 +08:00] [INFO] [grpclogger.go:69] ["Resolver state updated: {Addresses:[{Addr:192.168.80.138:20160 ServerName: Attributes:<nil> Type:0 Metadata:<nil>}] ServiceConfig:<nil> Attributes:<nil>} (resolver retu rned new addresses)"] [system=grpc] [grpc_log=true]


2) 问题分析

初步判断是网络通信问题或防火墙问题


3. 问题处理

3.1 网络验证

各节点ping测试,网络连接正常。


3.2 防火墙验证

1) 关闭防火墙

查看各节点,防火墙均已打开

[root@tikv1 ~]# systemctl status firewalld
firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-01-26 10:01:17 CST; 26min ago
Docs: man:firewalld(1)
Main PID: 9884 (firewalld)
CGroup: /system.slice/firewalld.service
└─9884 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid


Jan 26 10:01:16 tikv1 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 26 10:01:17 tikv1 systemd[1]: Started firewalld - dynamic firewall daemon.


2) 停止防火墙

关闭防火墙

[root@tikv1 ~]# systemctl stop firewalld


关闭开机启动

[root@tikv1 ~]# systemctl disable firewalld


3) 查看防火墙状态-已关闭

[root@tikv1 ~]# systemctl status firewalld
firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Wed 2022-01-26 10:28:30 CST; 55s ago
Docs: man:firewalld(1)
Process: 9884 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 9884 (code=exited, status=0/SUCCESS)


Jan 26 10:01:16 tikv1 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 26 10:01:17 tikv1 systemd[1]: Started firewalld - dynamic firewall daemon.
Jan 26 10:28:30 tikv1 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jan 26 10:28:30 tikv1 systemd[1]: Stopped firewalld - dynamic firewall daemon.


4. 再次重启集群-OK 

1) 再次启动防火墙-OK

[root@tidb1 ~]# tiup cluster restart tidb-test
...(skip)


2) 查看防火墙状态-OK

[root@tidb1 ~]# tiup cluster display tidb-test
Starting component `cluster`: /root/.tiup/components/cluster/v1.6.1/tiup-cluster display tidb-test
Cluster type: tidb
Cluster name: tidb-test
Cluster version: v5.2.1
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://192.168.80.135:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
192.168.80.135:9093 alertmanager 192.168.80.135 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093
192.168.80.135:3000 grafana 192.168.80.135 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
192.168.80.135:2379 pd 192.168.80.135 2379/2380 linux/x86_64 Up|UI /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.80.136:2379 pd 192.168.80.136 2379/2380 linux/x86_64 Up|L /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.80.137:2379 pd 192.168.80.137 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.80.135:9090 prometheus 192.168.80.135 9090 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
192.168.80.141:4000 tidb 192.168.80.141 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.80.138:20160 tikv 192.168.80.138 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
192.168.80.139:20160 tikv 192.168.80.139 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
192.168.80.140:20160 tikv 192.168.80.140 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
Total nodes: 10




Found cluster newer version:
The latest version: v1.8.2
Local installed version: v1.6.1
Update current component: tiup update cluster
Update all components: tiup update --all


- 完 -


旨在交流,不

::rundba

wx: landnow


 




                             长按二维码                                   


欢迎加入>>国产DB学习交流群


       

   请注明:来自rundba,加入国产DB学习交流群                

             


文章转载自rundba,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论