博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Elasticsearch节点磁盘空间耗尽
阅读量:3978 次
发布时间:2019-05-24

本文共 5438 字,大约阅读时间需要 18 分钟。

       最近遇到了一个特殊的情况,我们所使用的一个Elasticsearch集群的数据节点磁盘空间耗尽(out of space),啥事会发生呢? 数据损坏,集群RED。下面是相关的日志信息,我将其中一些关键点日志信息重点小时。这里 ES-Data_IN_11是当时的Master节点,ES-Data_IN_12是出现的磁盘耗尽的数据节点,出事儿的index名字为raw_v3.2017_03_22,我们仍然使用的是Elasticsearch 1.7.2。
[2017-03-22 11:57:30,503][WARN ][index.merge.scheduler ] [ES-Data_IN_12] [raw_v3.2017_03_22][1]
failed to merge
[2017-03-22 11:57:30,646][WARN ][index.engine ] [ES-Data_IN_12] [raw_v3.2017_03_22][1] failed engine [merge exception]
[2017-03-22 11:57:30,663][WARN ][indices.cluster ] [ES-Data_IN_12] [[raw_v3.2017_03_22][1]] marking and sending shard failed due to [engine failure, reason [merge exception]]
[2017-03-22 11:57:31,677][WARN ][cluster.action.shard ] [ES-Data_IN_11] [raw_v3.2017_03_22][1] received shard failed for [raw_v3.2017_03_22][1], node[h9d1tKJFRtmg1aG-VMkA4A], [P], s[STARTED], indexUUID [2Z-UtAr8Qx2h6Xe9BkxvAA], reason [shard failure [engine failure, reason [merge exception
]]
[MergeException[java.io.IOException: There is not enough space on the disk];
nested: IOException[There is not enough space on the disk]; ]]
[2017-03-22 11:57:36,883][WARN ][indices.cluster          ] [ES-Data_IN_12] [[raws_v3.2017_03_22][1]] marking and sending shard failed due to [failed to create shard]
org.elasticsearch.index.shard.IndexShardCreationException: [raw_v3.2017_03_22][1] failed to create shard
Caused by: org.apache.lucene.store.LockObtainFailedException:
Can't lock shard [raw_v3.2017_03_22][1]
, timed out after 5000ms
[2017-03-22 11:57:37,883][WARN ][cluster.action.shard     ] [ES-Data_IN_11] [raw_v3.2017_03_22][1] received shard failed for [raw_v3.2017_03_22][1], node[h9d1tKJFRtmg1aG-VMkA4A], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-03-22T11:57:31.677Z], details[shard failure [engine failure, reason [merge exception]][MergeException[java.io.IOException: There is not enough space on the disk]; nested: IOException[There is not enough space on the disk]; ]]], indexUUID [2Z-UtAr8Qx2h6Xe9BkxvAA], reason [shard failure [failed to create shard][IndexShardCreationException[[raw_v3.2017_03_22][1] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [raw_v3.2017_03_22][1], timed out after 5000ms]; ]]
[2017-03-22 11:58:36,308][WARN ][cluster.action.shard     ] [ES-Data_IN_11] [raw_v3.2017_03_22][1] received shard failed for [raw_v3.2017_03_22][1], node[h9d1tKJFRtmg1aG-VMkA4A], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-03-22T11:58:20.607Z], details[shard failure [failed to create shard][IndexShardCreationException[[raw_v3.2017_03_22][1] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [raw_v3.2017_03_22][1], timed out after 5000ms]; ]]], indexUUID [2Z-UtAr8Qx2h6Xe9BkxvAA], reason [shard failure [failed to create shard][IndexShardCreationException[[raw_v3.2017_03_22][1] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [raw_v3.2017_03_22][1], timed out after 5000ms]; ]]
[2017-03-22 11:59:57,286][WARN ][cluster.action.shard     ] [ES-Data_IN_11] [raw_v3.2017_03_22][1] received shard failed for [raw_v3.2017_03_22][1], node[h9d1tKJFRtmg1aG-VMkA4A], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-03-22T11:59:41.200Z], details[shard failure [failed to create shard][IndexShardCreationException[[raw_v3.2017_03_22][1] failed to create shard]; nested: LockObtainFailedException[Can't lock shard [raw_v3.2017_03_22][1], timed out after 5000ms]; ]]], indexUUID [2Z-UtAr8Qx2h6Xe9BkxvAA], reason [master [ES-Data_IN_11][1_-exjNlSsi7h_0lJYCoRA][RD0003FF7D543F][inet[/100.117.132.72:9300]]{fault_domain=2, update_domain=11, data=false, master=true}
marked shard as initializing, but shard is marked as failed, resend shard failure
]
[2017-03-22 12:32:00,716][WARN ][index.engine             ] [ES-Data_IN_12] [raw_v3.2017_03_22][0]
failed to sync translog
[2017-03-22 12:32:00,872][WARN ][indices.cluster          ] [ES-Data_IN_12] [[raw_v3.2017_03_22][0]] marking and sending shard failed due to [failed recovery] org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [raw_v3.2017_03_22][0] failed to recover shard
[2017-03-22 12:33:55,854][WARN ][cluster.action.shard     ] [ES-Data_IN_11] [raw_v3.2017_03_22][0] received shard failed for [raw_v3.2017_03_22][0], node[h9d1tKJFRtmg1aG-VMkA4A], [P], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-03-22T12:32:01.410Z], details[shard failure [failed recovery][IndexShardGatewayRecoveryException[[raw_v3.2017_03_22][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [46]]; ]]], indexUUID [2Z-UtAr8Qx2h6Xe9BkxvAA], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[raw_v3.2017_03_22][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [46]]; ]]
        
从上面列出的日志不难看出,当节点ES-Data_IN_12的磁盘空间耗尽时,首先出现反应出问题的是后台运行的segment merge操作,因为segment merge需要额外的空间存储merge的结果。磁盘的耗尽也造成了translog的损坏,似乎Elasticsearch无法自动修复。根据的讨论,手动删除.recovering文件可以解决这个问题,但是本人没有尝试过。最后需要指出的是,如果有replica在,可能情况就不是这样。我们索引之所以会RED,是因为一开始创建时唯一的replica没有被分配到任何机器上一直处于unassigned状态,所以在primary也出现问题时,就会没有可用的备份了。
       简单总结一下, Elasticsearch是基于Lucene的分布式索引系统,它以来本地的文件系统来存储Lucene文件,所以存储空间大小对于Elasticsearch至关重要。Elasticsearch提供了来控制使用了存储空间多少时才分配shard到该节点上,其默认值为85%。


转载地址:http://owgki.baihongyu.com/

你可能感兴趣的文章
单表60亿记录等大数据场景的MySQL优化和运维之道 | 高可用架构
查看>>
MySQL Replication 常用架构
查看>>
php生成PDF文件(FPDF)
查看>>
从MVC到前后端分离(REST-个人也认为是目前比较流行和比较好的方式)
查看>>
【SDCC讲师专访】Swoole开源项目创始人韩天峰:PHP是最好的编程语言
查看>>
js判断浏览器在PC中打开还是移动设备中打开
查看>>
PHP服务器信息探针可以检测网络流量,CPU,硬盘,内存使用情况,网站管理员必备
查看>>
centos6.6 安装python环境及Django 1.9.0
查看>>
Centos 6.6 安装python3.4及Django1.9
查看>>
centos6.5 安装python 3.5及pip安装
查看>>
centos-6.6安装nginx-1.9.7和php7.0.0(一)
查看>>
centos-6.6安装nginx-1.9.7和php7.0.0(二)
查看>>
大型网站架构之分布式消息队列
查看>>
datatable 列排序
查看>>
MySql事件
查看>>
架构 Varnish+nginx+php(FastCGI)+MYSQL5+MenCache+MenCachedb
查看>>
Varnish的使用及安装
查看>>
MySQL相关学习资料分享
查看>>
PHP高并发高负载系统架构
查看>>
CentOS 6.4下Squid代理服务器的安装与配置
查看>>