暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

中间件Tomcat进程服务无法关闭服务之Dubbo优雅下线问题分析解决

原创 Ta10nK1n9 云和恩墨 2022-09-19
2067

问题现象

Tomcat进行在执行shutdown.sh脚本后无法停止,JMX端口一直被占用,导致再次启动Tomcat服务时会报出JMX端口已被占用的错误信息,经过排查发现出现该问题的应用都使用了Dubbo框架,涉及到Dubbo优雅停机服务问题。

错误日志

Tomcat运行shutdown.sh之后,catalina.out输出错误日志如下:

17-Jul-2020 17:47:22.348 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [iot-app] appears to have started a thread named [DubboRegistryFailedRetryTimer-thread-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:

sun.misc.Unsafe.park(Native Method)

java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)

java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)

java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)

java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

java.lang.Thread.run(Thread.java:748)

17-Jul-2020 17:47:22.349 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [iot-app] appears to have started a thread named [DubboSaveRegistryCache-thread-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:

sun.misc.Unsafe.park(Native Method)

java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)

java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

java.lang.Thread.run(Thread.java:748)

17-Jul-2020 17:47:22.350 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [iot-app] appears to have started a thread named [DubboRegistryFailedRetryTimer-thread-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:

sun.misc.Unsafe.park(Native Method)

java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)

java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)

java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

java.lang.Thread.run(Thread.java:748)

问题原因

看com.alibaba.dubbo.remoting.transport.netty.NettyClient中的两段代码:

//因ChannelFactory的关闭有DirectMemory泄露,采用静态化规避

//https://issues.jboss.org/browse/NETTY-424

private static final ChannelFactory channelFactory = new NioClientSocketChannelFactory(Executors.newCachedThreadPool(new NamedThreadFactory("NettyClientBoss", true)), Executors.newCachedThreadPool(new NamedThreadFactory("NettyClientWorker", true)), Constants.DEFAULT_IO_THREADS);

上面是为netty创建boss线程池和worker线程池。注释里提到了一个netty内存泄露的bug,在大量创建和关闭channelFactory时会引起堆外内存的大量泄露,从而引发OOM。dubbo为了避免频繁的创建和关闭,将channelFactory设置成了static,与此同时将下图doClose()方法中的释放资源的操作注释掉了(bootstrap.releaseExternalResources()最终会调用channelFactory.releaseExternalResources()),channelFactory不会被显式关闭。

@Override

protected void doClose() throws Throwable {

/*try {

bootstrap.releaseExternalResources();

} catch (Throwable t) {

logger.warn(t.getMessage());

}*/

}

看NioClientSocketChannelFactory的构造函数发现,除了创建了boss和worker两个线程池之外,还创建了一个HashedWheelTimer定时器线程,而这个线程是非daemon的。由于这个非daemon的线程没有被显式地关闭,导致JVM一直不会退出。tomcat只能关闭自己起的线程,servlet容器内部应用创建的线程是无法被tomcat关闭的,应该由应用自己管理和关闭,因此,即使tomcat容器被销毁了,但进程依然还在。

Tomcat关闭失败后手动生成线程堆栈,在堆栈中发现一个非daemon线程存在,并且为HashedWheelTimer线程,具体信息如下:

"Hashed wheel timer #1" #46 prio=5 os_prio=0 tid=0x00007f06889f9000 nid=0x52d7 waiting on condition [0x00007f06810e3000]

java.lang.Thread.State: TIMED_WAITING (sleeping)

at java.lang.Thread.sleep(Native Method)

at org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:503)

at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:401)

at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)

at java.lang.Thread.run(Thread.java:748)

Dubbo在销毁的过程中为了规避netty内存泄露的bug,没有显式地释放HashedWheelTimer线程资源,JVM无法正常退出。

解决方案

解决方案仅供参考,请根据应用的实际情况自行制定解决方法。

方案一

在tomcat关闭的时候手动关闭HashedWheelTimer(通过反射),也就是说这时候一切会按照dubbo的“计划”执行,DubboShutdownHook也会执行了。

ProtocolConfig.destroyAll(),有两个优化点:

1)provider摘掉zk节点后,理论上consumer收到通知会立即更新provider列表,但因为provider从注册中心撤销服务和consumer将其从服务列表中删除并不是原子操作,如果集群规模过大,可能导致上游consumer的服务列表还未更新完成,provider这时发现当前没有进行中的调用就立马关闭服务暴露,导致上游consumer调用该服务失败。所以,dubbo默认的这种优雅停机方案,需要建立在上游consumer有重试机制的基础之上,但由于consumer增加重试特性会增加故障时的雪崩风险,所以大多数分布式服务不愿意增加服务内部之间的重试机制。问题的本质上是因为provider摘除zk节点没有给consumer足够的时间来更新服务列表,简单的解决方式是在摘除zk节点之后、销毁协议之前,主动sleep一段时间,sleep这段时间内provider依然能够处理来自未来得及更新provider列表的consumer的请求,极大地减小调用失败的概率。

2)假如使用的是Dubbo协议,DubboProtocol.destroy()方法中,当provider对外关闭暴露并且已有任务执行完成之后,理论上此时可以立即关闭consumer,即代码中的client.close(),因为既然上游服务的调用已全部处理完成、且不再有新的调用过来,那么理论上此时下游的服务也已经执行完成。但是考虑到业务中可能有其他类型的请求调用了下游的rpc服务(例如定时任务),立即关闭client可能导致这部分调用失败。因此,应调用client.close(int timeout)方法,等待这部分调用执行完成。

最终,改动以下两处代码:

ProtocolConfig.destroyAll():

public static void destroyAll()

{

AbstractRegistryFactory.destroyAll();

try{

Thread.sleep(5000);

}catch(InterruptedException e){

logger.warn("Interrupted unexpectedly when waiting for registry notification druing shutdown process!");

}

ExtensionLoader loader = ExtensionLoader.getExtensionLoader(Protocol.class);

for (String protocolName : loader.getLoadedExtensions())

try {

Protocol protocol = (Protocol)loader.getLoadedExtension(protocolName);

if (protocol != null)

protocol.destroy();

}

catch (Throwable t) {

logger.warn(t.getMessage(), t);

}

}

DubboProtocol.detroy():

for(String key : newArrayList<String>(referenceClientMap.keySet())) {

ExchangeClient client =referenceClientMap.remove(key);

if(client !=null) {

try{

if (logger.isInfoEnabled()){

logger.info("Close dubbo connect: " + client.getLoacalAddress() + "-->" + client.getRemoteAddress());

}

client.close(getServerShutdownTimeout);

}catch(Throwable t){

logger.warn(t.getMessage(), t);

}

}

}

方案二

//先释放dubbo所占用的资源

ProtocolConfig.destroyAll();

//用反射释放NettyClient所占用的资源, 以避免不能优雅shutdown的问题

releaseNettyClientExternalResources();

private void releaseNettyClientExternalResources() {

try {

Field field = NettyClient.class.getDeclaredField("channelFactory");

field.setAccessible(true);

ChannelFactory channelFactory = (ChannelFactory) field.get(NettyClient.class);

channelFactory.releaseExternalResources();

field.setAccessible(false);

LOGGER.info("Release NettyClient's external resources");

} catch (Exception e){

LOGGER.error("Release NettyClient's external resources error", e);

}

}

Reference:

https://xiangxianzui.github.io/2019/08/dubbo优雅下线分析/

https://www.jianshu.com/p/7ae242c00279

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论