问题描述:YARN莫名重启、Flink任务挂掉(脚本检测到之后自动恢复任务)
YARN后台日志
显示连不上Zookeeper并触发ResourceManager HA选举,
找不到Active的ResourceManager了。

HA状态切换为standby之后,开始停止ResourceManager相关服务(8032-RM对Client的服务端口、8030-RM对AM的服务端口、8031-RM对NM的服务端口)。

然后开始Recover,恢复RM...。

RM重启后开始接收Container状态注册(Flink任务),时间戳1586772031875 显示是2020-04-13 18:00:31创建的任务。RM发现注册的Container是未知应用,在RM上下文环境里面找不到了,然后就添加到已完成的应用列表里面了-后续清理掉。

源码简读
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl
private static void handleRunningAppOnNode(RMNodeImpl rmNode,RMContext context, ApplicationId appId, NodeId nodeId) {RMApp app = context.getRMApps().get(appId);if we failed getting app by appId, maybe something wrong happened, justadd the app to the finishedApplications list so that the app can becleaned up on the NMif (null == app) {LOG.warn("Cannot get RMApp by appId=" + appId+ ", just added it to finishedApplications list for cleanup");rmNode.finishedApplications.add(appId);rmNode.runningApplications.remove(appId);return;}Add running applications back due to Node add or Node reconnection.rmNode.runningApplications.add(appId);context.getDispatcher().getEventHandler().handle(new RMAppRunningOnNodeEvent(appId, nodeId));}
Flink任务检测脚本检测到任务挂了之后重新提交给YARN。

ZK后台日志
相同时间,发现WARN异常警告。session超时、然后shutdown。

重点:WAL同步延迟,耗时约22秒,关闭了与leader的连接变为LOOKING状态,而后根据FastLeaderElection算法进行新的选举。

源码简读
org.apache.zookeeper.server.SyncRequestProcessor#flush(zks.getZKDatabase().commit();)org.apache.zookeeper.server.ZKDatabase#commit(this.snapLog.commit();)org.apache.zookeeper.server.persistence.FileTxnSnapLog#commit(txnLog.commit();)org.apache.zookeeper.server.persistence.FileTxnLog#commit
/*** commit the logs. make sure that everything hits the* disk*/public synchronized void commit() throws IOException {if (logStream != null) {logStream.flush();}for (FileOutputStream log : streamsToFlush) {log.flush();if (forceSync) {long startSyncNS = System.nanoTime();FileChannel channel = log.getChannel();channel.force(false);syncElapsedMS = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startSyncNS);if (syncElapsedMS > fsyncWarningThresholdMS) {if (serverStats != null) {serverStats.incrementFsyncThresholdExceedCount();}LOG.warn("fsync-ing the write ahead log in {} took {}ms which will adversely effect operation latency."+ "File size is {} bytes. See the ZooKeeper troubleshooting guide",Thread.currentThread().getName(),syncElapsedMS,channel.size());}ServerMetrics.getMetrics().FSYNC_TIME.add(syncElapsedMS);}}while (streamsToFlush.size() > 1) {streamsToFlush.poll().close();}// Roll the log file if we exceed the size limitif (txnLogSizeLimit > 0) {long logSize = getCurrentLogSize();if (logSize > txnLogSizeLimit) {LOG.debug("Log size limit reached: {}", logSize);rollLog();}}}
问题解决
修改ZK配置并重启集群、问题解决(if (forceSync)),但是这里也是有缺陷的,force是用来保证数据完全刷到磁盘的。设置为no后,一定程度上提高ZK的写性能,但同时也会存在类似于机器断电这样的安全风险。
另外:在没有与HBase共用ZK之前一直没有出现此异常,因此需要注意多份ZK集群的隔离部署问题。
minSessionTimeout=30000maxSessionTimeout=60000skipACL=yesforceSync=no
【END】
文章转载自大数据真有意思,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




