目录

弹性计算平台-组件部署文档

概述

本文是笔者在前司建设的基于Kubernetes构建的弹性计算平台集成Spark的相关使用文档,我们是国内最早基于Kubernetes运行Spark计算任务的团队之一

部署这些组件的时候,需要注意 Image 的版本,以及启动的 CommandsArgs 这些。

目前是先整理出 Deployment 方便直接部署,以后会逐渐迁移到 Helm。另外需要注意这些 Deployment 可能是有需要 ConfigMap 的,所以不要遗漏了。

组件 部署方法 其他
tf-operator deployment 用于运行分布式Tensorflow
spark-operator deployment 用于运行Spark2.3+
mpi-operator deployment 用于运行并行计算
spark-history-server deployment Spark需要的组件,所有集群可以共用一个
spark-track-server deployment Spark需要的组件,每个集群一个

tf-operator

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    component: tf-job-operator
  name: tf-job-operator
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      name: tf-job-operator
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8443"
        prometheus.io/scrape: "true"
      labels:
        name: tf-job-operator
    spec:
      containers:
      - command:
        - /opt/tf-operator.v1
        - --alsologtostderr
        - -v=5
        - --enable-gang-scheduling
        - --gang-scheduler-name=tencent-batch
        - --json-log-format=false
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: hub.oa.com/runzhliu/tf-operator:batch
        imagePullPolicy: Always
        ports:
        - containerPort: 8443
          protocol: TCP
          name: metrics
        name: tf-job-operator
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

spark-operator

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    component: spark-operator
  name: spark-sparkoperator
  namespace: kube-system
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: spark-operator
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
      labels:
        component: spark-operator
    spec:
      volumes:
        - name: template
          configMap:
            name: template
      containers:
      - args:
        - -v=5
        - -namespace=
        - -ingress-url-format=
        - -controller-threads=10
        - -resync-interval=30
        - -logtostderr
        - -enable-batch-scheduler=true
        - -enable-metrics=true
        - -metrics-labels=project
        - -metrics-port=10254
        - -metrics-endpoint=/metrics
        - -metrics-prefix=
        image: hub.oa.com/runzhliu/spark-operator:latest-template
        imagePullPolicy: Always
        name: sparkoperator
        volumeMounts:
        - name: template
          mountPath: /etc/config
        ports:
        - containerPort: 10254
          protocol: TCP
          name: metrics
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 40
警告
本文最后更新于 2024年1月9日,文中内容可能已过时,请谨慎参考。