runzhliu

Spark写ES的性能问题分析

runzhliu 发布于 2017-02-01, 更新于 2017-02-01, 收录于大数据和机器学习

概述参考资料 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/performance.html https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster 摘抄一段官方文档的描述： Write performance A crucial aspect in improving the write performance is to determine the maximum rate of data that Elasticsearch can ingest comfortably. This depends on many variables (data size, hardware, current load, etc..) but a good rule of thumb is for a bulk request to not take longer than

Spark性能调优之Shuffle调优

runzhliu 发布于 2017-02-01, 更新于 2017-02-01, 收录于大数据和机器学习

概述本文整理自: https://www.cnblogs.com/haozhengfei/p/5fc4a976a864f33587b094f36b72c7d3.html 正文 Spark 底层 shuffle 的传输方式是使用 netty 传输，netty 在进行网络传输的过程会申请堆外内存（netty是零拷贝），所以使用了堆外内存

Spark性能优化指南-高级篇

runzhliu 发布于 2017-02-01, 更新于 2017-02-01, 收录于大数据和机器学习

概述有的时候，我们可能会遇到大数据计算中一个最棘手的问题——数据倾斜，此时Spark作业的性能会比期望差很多。数据倾斜调优，就是使用各种技术

Spark优化

runzhliu 发布于 2017-02-01, 更新于 2017-02-01, 收录于大数据和机器学习

概述本文转自: https://blog.csdn.net/Winner941112/article/details/82899277 Spark优化(一): 避免重复RDD 通常来说，我们在开发一个 Spark 作业时，首先是基于某个数据源（比如Hive表或HDFS文件）创

Spark中的RPC

runzhliu 发布于 2017-02-01, 更新于 2017-02-01, 收录于大数据和机器学习

概述本文是转载的: https://zhuanlan.zhihu.com/p/28893155 Spark 是一个快速的、通用的分布式计算系统，而分布式的特性就意味着，必然存在节点间的通信，本文主要介绍不同的 Spark 组件之间是如何通

SQL基础

runzhliu 发布于 2017-02-01, 更新于 2017-02-01, 收录于存储

概述下面聊几个 SQL 的基础知识。 DML DML（data manipulation language）数据操纵语言，就是我们最经常用到的 SELECT、UPDATE、INSERT