目录

Spark写ES的性能问题分析

目录

概述

参考资料 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/performance.html https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster

摘抄一段官方文档的描述:

Write performance A crucial aspect in improving the write performance is to determine the maximum rate of data that Elasticsearch can ingest comfortably. This depends on many variables (data size, hardware, current load, etc..) but a good rule of thumb is for a bulk request to not take longer than 1-2s to be successfully processed. Since elasticsearch-hadoop performs parallel writes, it is important to keep this in mind across all tasks, which are created by Hadoop/Spark at runtime.

有一个通用的标准来衡量什么才是合适的写入操作,就是一个 bulk request 的操作不要超过 1-2 秒。同时官方文档也提出了几个优化的方向。

Decrease bulk size Remember that elasticsearch-hadoop allows one to configure the number of entries and size for a batch write to Elasticsearch per task. That is, assuming there are T tasks, with a configuration of B bytes and N number of documents (where d is the average document size), the maximum number of bulk write requests at a given point in time can be TB bytes or TN number of docs (TNd in bytes) - which ever comes first. Thus for a job with 5 tasks, using the defaults (1mb or 1000 docs) means up to 5mb/5000 docs bulk size (spread across shards). If this takes more than 1-2s to be processed, there’s no need to decrease it. If it’s less then that, you can try increasing it in small steps.

减少 bulk 的大小。

Use a maximum limit of tasks writing to Elasticsearch In case of elasticsearch-hadoop, the runtime (Hadoop or Spark) can and will create multiple tasks that will write at the same time to Elasticsearch. In some cases, this leads to a disproportionate number of tasks (sometimes one or even two orders of magnitude higher) between what the user planned to use for Elasticsearch and the actual number. This appears in particular when dealing with inputs that are highly splittable which can easily generate tens or hundreds of tasks. If the target Elasticsearch cluster has just a few nodes, likely dealing with so much data at once will be problematic.

尽量用最多的 Task 来写 ES。但是如果 ES 集群的节点不多,那么在应对大量数据的时候是会有问题的。

Understand why rejections happen Under too much load, Elasticsearch starts rejecting documents - in such a case elasticsearch-hadoop waits for a while (default 10s) and then retries (by default up to 3 times). If Elasticsearch keeps rejecting documents, the job will eventually fail. In such a scenario, monitor Elasticsearch (through Marvel or other plugins) and keep an eye on bulk processing. Look at the percentage of documents being rejected; it is perfectly fine to have some documents rejected but anything higher then 10-15% on a regular basis is a good indication the cluster is overloaded.

必须要去理解为什么会拒绝写入。如果负载太高,ES 是会拒绝文档的。这种情况下,ES 会等待默认10秒,再进行最多三次的重试,否则就会导致写入任务失败。因此需要留意拒绝文档的比例等指标。

Keep the number of retries and waits at bay As mentioned above, retries happen when documents are being rejected. This can be for a number of reasons - a sudden indexing spike, node relocation, etc… If your job keeps being aborted as the retries fail, this is a good indication your cluster is overloaded. Increasing it will not make the problem go away, rather it will just hide it under the rug; the job will be quite slow (as likely each write will be retried several times until all documents are acknowledged).

如果写入的 Job 一直在重试,那么说明此时 ES 集群的负载是非常高。

警告
本文最后更新于 2017年2月1日,文中内容可能已过时,请谨慎参考。