Elasticsearch-某副本分片分配异常排查

原创陌殇流苏 2024-11-26

198

背景

告警信息提示ES集群状态不健康

查看集群健康状态

GET /_cluster/health
  "status" : "yellow",
  "unassigned_shards" : 1,
  "active_shards_percent_as_number" : 99.96272828922848
#-- 发现有有个副本分片为正常分配

检查分片情况

GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason
index                       shard prirep state      unassigned.reason
qw_weworkchat_2024-11-23    0     r      UNASSIGNED ALLOCATION_FAILED

检查分片失败原因

POST /_cluster/allocation/explain
{
  "index": "qw_weworkchat_2024-11-23",
  "shard": 0,
  "primary": false
}


{
  "index" : "qw_weworkchat_2024-11-23",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2024-11-23T06:34:15.091Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [DDynItBETCezGi61g_v7ZA]: failed recovery, failure RecoveryFailedException[[qw_weworkchat_2024-11-23][0]: Recovery failed from {1657502388003343432}{TCMBp8W_QPe8DousK_LhYw}{dF2FwQ2rRDqYgXhSdYpdMA}{10.15.128.33}{10.15.128.33:9300}{cdhilmrstw}{ml.machine_memory=8157544448, rack=cvm_4_200005, xpack.installed=true, set=200005, transform.node=true, ip=9.20.80.191, temperature=hot, ml.max_open_jobs=20, region=4} into {1657502388003343532}{DDynItBETCezGi61g_v7ZA}{eNntIvJeRhur3BT_uFd5PA}{10.15.128.40}{10.15.128.40:9300}{cdhilmrstw}{ml.machine_memory=8157544448, rack=cvm_4_200005, xpack.installed=true, set=200005, transform.node=true, ip=9.20.81.78, temperature=hot, ml.max_open_jobs=20, region=4}]; 】nested: RemoteTransportException[[1657502388003343432][10.15.128.33:9300][internal:index/shard/recovery/start_recovery]]; nested: ParentCircuitBreakingException[[parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [3971273648/3.6gb], which is larger than the limit of [3865470566/3.5gb] , real usage: [3971241072/3.6gb], new bytes reserved: [32576/31.8kb], usages [request=0/0b, fielddata=3600/3.5kb, in_flight_requests=32576/31.8kb, model_inference=0/0b, single_request=0/0b, accounting=100181578/95.5mb]]; ",
    "last_allocation_status" : "no_attempt"
  }

#--  分片分配失败的原因是 ParentCircuitBreakingException。这表示分片恢复的过程中，数据量超过了内存或堆内存的上限 ([parent] Data too large)。这种情况通常发生在节点堆内存（JVM heap）配置不足或负载较高时。

解决方案

#-- 方案一
POST _cluster/reroute?retry_failed=true
#-- 方案二
找到 Elasticsearch 的配置文件 jvm.options ， 增加堆内存大小为4GB
-Xms4g
-Xmx4g
重启节点
systemctl restart elasticsearch
若是还未有效解决，需要再排查parent内存熔断限制
GET _cluster/settings
调整熔断限制
PUT _cluster/settings
{
  "transient": {
    "indices.breaker.total.limit": "95%"
  }
}

elasticsearch es

最后修改时间：2024-11-26 13:29:20

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

Elasticsearch-某副本分片分配异常排查

评论