相关背景:

https://issues.apache.org/jira/browse/HBASE-9393(Region Server fails to properly close socket resulting in many CLOSE_WAIT to Data Nodes)https://issues.apache.org/jira/browse/HDFS-7694(FSDataInputStream should support "unbuffer")
CLOSE_WAIT

keepalive机制
客户端或者服务端意外断电、死机、进程挂掉重启等;
中间网络出现问题,连接双方无法知道一直等待;
程序问题导致的长时间CLOSE_WAIT问题;
net.ipv4.tcp_keepalive_intvl = 75net.ipv4.tcp_keepalive_probes = 9net.ipv4.tcp_keepalive_time = 7200
tcp_keepalive_time,在TCP保活打开的情况下,最后一次数据交换到TCP发送第一个保活探测包的间隔,即允许的持续空闲时长,或者说每次正常发送心跳的周期,默认值为7200s(2h)。
tcp_keepalive_probes 在tcp_keepalive_time之后,没有接收到对方确认,继续发送保活探测包次数,默认值为9(次)
tcp_keepalive_intvl,在tcp_keepalive_time之后,没有接收到对方确认,继续发送保活探测包的发送频率,默认值为75s。
图一. 正常ack,保持连接

图二. 对方响应rst,释放连接

图三. 对方服务无响应,释放连接
源码解析:
/*** Enable/disable {@link SocketOptions#SO_KEEPALIVE SO_KEEPALIVE}.** @param on whether or not to have socket keep alive turned on.* @exception SocketException if there is an error* in the underlying protocol, such as a TCP error.* @since 1.3* @see #getKeepAlive()*/public void setKeepAlive(boolean on) throws SocketException {if (isClosed())throw new SocketException("Socket is closed");getImpl().setOption(SocketOptions.SO_KEEPALIVE, Boolean.valueOf(on));}
SocketOptions相关代码:
/*** When the keepalive option is set for a TCP socket and no data* has been exchanged across the socket in either direction for* 2 hours (NOTE: the actual value is implementation dependent),* TCP automatically sends a keepalive probe to the peer. This probe is a* TCP segment to which the peer must respond.* One of three responses is expected:* 1. The peer responds with the expected ACK. The application is not* notified (since everything is OK). TCP will send another probe* following another 2 hours of inactivity.* 2. The peer responds with an RST, which tells the local TCP that* the peer host has crashed and rebooted. The socket is closed.* 3. There is no response from the peer. The socket is closed.** The purpose of this option is to detect if the peer host crashes.** Valid only for TCP socket: SocketImpl** @see Socket#setKeepAlive* @see Socket#getKeepAlive*/@Native public final static int SO_KEEPALIVE = 0x0008;
int opt = 1;setsockopt(sockfd, SOL_SOCKET, SO_KEEPALIVE, (void*)&opt, sizeof(opt));
linux tcp keepalive机制相关源码实现可查看net/core/sock.c、net/ipv4/tcp_timer.c、net/ipv4/tcp_timer.c
重点总结
a. java层面开启keepalive需要通过socket实例调用setKeepAlive进行设置(建议在两端均设置),只能配置开关,其他参数依赖于sysctl在系统层面进行配置。
b. C语言开启keepalive需要在socket实例上调用setsockopt设置。
c. 调整keepalive内核参数后对现有已打开keepalive机制的socket链接直接生效,无需重启。
d. tcp keep-alive机制可以解决大量连接无法回收、占用资源的问题.





