问题现象
Tomcat进行在执行shutdown.sh脚本后无法停止,JMX端口一直被占用,导致再次启动Tomcat服务时会报出JMX端口已被占用的错误信息,经过排查发现出现该问题的应用都使用了Dubbo框架,涉及到Dubbo优雅停机服务问题。
错误日志
Tomcat运行shutdown.sh之后,catalina.out输出错误日志如下:
17-Jul-2020 17:47:22.348 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [iot-app] appears to have started a thread named [DubboRegistryFailedRetryTimer-thread-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) 17-Jul-2020 17:47:22.349 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [iot-app] appears to have started a thread named [DubboSaveRegistryCache-thread-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) 17-Jul-2020 17:47:22.350 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [iot-app] appears to have started a thread named [DubboRegistryFailedRetryTimer-thread-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) |
问题原因
看com.alibaba.dubbo.remoting.transport.netty.NettyClient中的两段代码:
//因ChannelFactory的关闭有DirectMemory泄露,采用静态化规避 //https://issues.jboss.org/browse/NETTY-424 private static final ChannelFactory channelFactory = new NioClientSocketChannelFactory(Executors.newCachedThreadPool(new NamedThreadFactory("NettyClientBoss", true)), Executors.newCachedThreadPool(new NamedThreadFactory("NettyClientWorker", true)), Constants.DEFAULT_IO_THREADS); |
上面是为netty创建boss线程池和worker线程池。注释里提到了一个netty内存泄露的bug,在大量创建和关闭channelFactory时会引起堆外内存的大量泄露,从而引发OOM。dubbo为了避免频繁的创建和关闭,将channelFactory设置成了static,与此同时将下图doClose()方法中的释放资源的操作注释掉了(bootstrap.releaseExternalResources()最终会调用channelFactory.releaseExternalResources()),channelFactory不会被显式关闭。
@Override protected void doClose() throws Throwable { /*try { bootstrap.releaseExternalResources(); } catch (Throwable t) { logger.warn(t.getMessage()); }*/ } |
看NioClientSocketChannelFactory的构造函数发现,除了创建了boss和worker两个线程池之外,还创建了一个HashedWheelTimer定时器线程,而这个线程是非daemon的。由于这个非daemon的线程没有被显式地关闭,导致JVM一直不会退出。tomcat只能关闭自己起的线程,servlet容器内部应用创建的线程是无法被tomcat关闭的,应该由应用自己管理和关闭,因此,即使tomcat容器被销毁了,但进程依然还在。
Tomcat关闭失败后手动生成线程堆栈,在堆栈中发现一个非daemon线程存在,并且为HashedWheelTimer线程,具体信息如下:
"Hashed wheel timer #1" #46 prio=5 os_prio=0 tid=0x00007f06889f9000 nid=0x52d7 waiting on condition [0x00007f06810e3000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:503) at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:401) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at java.lang.Thread.run(Thread.java:748) |
Dubbo在销毁的过程中为了规避netty内存泄露的bug,没有显式地释放HashedWheelTimer线程资源,JVM无法正常退出。
解决方案
解决方案仅供参考,请根据应用的实际情况自行制定解决方法。
方案一
在tomcat关闭的时候手动关闭HashedWheelTimer(通过反射),也就是说这时候一切会按照dubbo的“计划”执行,DubboShutdownHook也会执行了。
ProtocolConfig.destroyAll(),有两个优化点:
1)provider摘掉zk节点后,理论上consumer收到通知会立即更新provider列表,但因为provider从注册中心撤销服务和consumer将其从服务列表中删除并不是原子操作,如果集群规模过大,可能导致上游consumer的服务列表还未更新完成,provider这时发现当前没有进行中的调用就立马关闭服务暴露,导致上游consumer调用该服务失败。所以,dubbo默认的这种优雅停机方案,需要建立在上游consumer有重试机制的基础之上,但由于consumer增加重试特性会增加故障时的雪崩风险,所以大多数分布式服务不愿意增加服务内部之间的重试机制。问题的本质上是因为provider摘除zk节点没有给consumer足够的时间来更新服务列表,简单的解决方式是在摘除zk节点之后、销毁协议之前,主动sleep一段时间,sleep这段时间内provider依然能够处理来自未来得及更新provider列表的consumer的请求,极大地减小调用失败的概率。
2)假如使用的是Dubbo协议,DubboProtocol.destroy()方法中,当provider对外关闭暴露并且已有任务执行完成之后,理论上此时可以立即关闭consumer,即代码中的client.close(),因为既然上游服务的调用已全部处理完成、且不再有新的调用过来,那么理论上此时下游的服务也已经执行完成。但是考虑到业务中可能有其他类型的请求调用了下游的rpc服务(例如定时任务),立即关闭client可能导致这部分调用失败。因此,应调用client.close(int timeout)方法,等待这部分调用执行完成。
最终,改动以下两处代码:
ProtocolConfig.destroyAll():
public static void destroyAll() { AbstractRegistryFactory.destroyAll(); try{ Thread.sleep(5000); }catch(InterruptedException e){ logger.warn("Interrupted unexpectedly when waiting for registry notification druing shutdown process!"); }
ExtensionLoader loader = ExtensionLoader.getExtensionLoader(Protocol.class); for (String protocolName : loader.getLoadedExtensions()) try { Protocol protocol = (Protocol)loader.getLoadedExtension(protocolName); if (protocol != null) protocol.destroy(); } catch (Throwable t) { logger.warn(t.getMessage(), t); } } |
DubboProtocol.detroy():
for(String key : newArrayList<String>(referenceClientMap.keySet())) { ExchangeClient client =referenceClientMap.remove(key); if(client !=null) { try{ if (logger.isInfoEnabled()){ logger.info("Close dubbo connect: " + client.getLoacalAddress() + "-->" + client.getRemoteAddress()); } client.close(getServerShutdownTimeout); }catch(Throwable t){ logger.warn(t.getMessage(), t); } } } |
方案二
//先释放dubbo所占用的资源 ProtocolConfig.destroyAll(); //用反射释放NettyClient所占用的资源, 以避免不能优雅shutdown的问题 releaseNettyClientExternalResources(); private void releaseNettyClientExternalResources() { try { Field field = NettyClient.class.getDeclaredField("channelFactory"); field.setAccessible(true); ChannelFactory channelFactory = (ChannelFactory) field.get(NettyClient.class); channelFactory.releaseExternalResources(); field.setAccessible(false); LOGGER.info("Release NettyClient's external resources"); } catch (Exception e){ LOGGER.error("Release NettyClient's external resources error", e); } } |
Reference:
https://xiangxianzui.github.io/2019/08/dubbo优雅下线分析/
https://www.jianshu.com/p/7ae242c00279




