目录

Kuberntetes-PySpark计算任务

概述

要在弹性计算平台上提交 PySpark 任务,首先需要认识一下提交任务时候的镜像文件。

默认提供的镜像是基于 Spark Master 分支上的最新代码。由于即将发布的 Spark 3.0 在镜像文件中不会有太大的改动了,所以以 Master 分支上的 Dockerfile 为准。

Spark-py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
ARG base_img

FROM $base_img
WORKDIR /

# Reset to root to run installation tasks
USER 0

RUN mkdir ${SPARK_HOME}/python
# TODO: Investigate running both pip and pip3 via virtualenvs
RUN apt install -y python python-pip && \
    apt install -y python3 python3-pip && \
    # We remove ensurepip since it adds no functionality since pip is
    # installed on the image and it just takes up 1.6MB on the image
    rm -r /usr/lib/python*/ensurepip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

USER ${spark_uid}

build 镜像最重要的是把左边方框的文件夹拷贝进来。

/kuberntetes-pyspark%E8%AE%A1%E7%AE%97%E4%BB%BB%E5%8A%A1/image_1dn6la7a6pjafg41mf4jrknhn19.png

PySpark 的架构。

/kuberntetes-pyspark%E8%AE%A1%E7%AE%97%E4%BB%BB%E5%8A%A1/image_1dn6m94lel0a1no01q8t171s42o1m.png /kuberntetes-pyspark%E8%AE%A1%E7%AE%97%E4%BB%BB%E5%8A%A1/image_1dn6mrp3p5vofdv1i591hg7vl423.png /kuberntetes-pyspark%E8%AE%A1%E7%AE%97%E4%BB%BB%E5%8A%A1/image_1dn6nq9uo1graelv1bek1vte1po330.png /kuberntetes-pyspark%E8%AE%A1%E7%AE%97%E4%BB%BB%E5%8A%A1/image_1dn6pmt0013jiqah1l3b1qr01p0t3d.png

参考资料

  1. https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
警告
本文最后更新于 2017年2月1日,文中内容可能已过时,请谨慎参考。