概述
本文是笔者在前司建设的基于Kubernetes构建的弹性计算平台集成Spark的相关使用文档,我们是国内最早基于Kubernetes运行Spark计算任务的团队之一
部署这些组件的时候,需要注意 Image
的版本,以及启动的 Commands
和 Args
这些。
目前是先整理出 Deployment
方便直接部署,以后会逐渐迁移到 Helm。另外需要注意这些 Deployment
可能是有需要 ConfigMap
的,所以不要遗漏了。
组件 |
部署方法 |
其他 |
tf-operator |
deployment |
用于运行分布式Tensorflow |
spark-operator |
deployment |
用于运行Spark2.3+ |
mpi-operator |
deployment |
用于运行并行计算 |
spark-history-server |
deployment |
Spark需要的组件,所有集群可以共用一个 |
spark-track-server |
deployment |
Spark需要的组件,每个集群一个 |
tf-operator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
component: tf-job-operator
name: tf-job-operator
namespace: kube-system
spec:
replicas: 2
selector:
matchLabels:
name: tf-job-operator
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8443"
prometheus.io/scrape: "true"
labels:
name: tf-job-operator
spec:
containers:
- command:
- /opt/tf-operator.v1
- --alsologtostderr
- -v=5
- --enable-gang-scheduling
- --gang-scheduler-name=tencent-batch
- --json-log-format=false
env:
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: MY_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: hub.oa.com/runzhliu/tf-operator:batch
imagePullPolicy: Always
ports:
- containerPort: 8443
protocol: TCP
name: metrics
name: tf-job-operator
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
|
spark-operator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
component: spark-operator
name: spark-sparkoperator
namespace: kube-system
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
component: spark-operator
strategy:
type: Recreate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "10254"
prometheus.io/scrape: "true"
labels:
component: spark-operator
spec:
volumes:
- name: template
configMap:
name: template
containers:
- args:
- -v=5
- -namespace=
- -ingress-url-format=
- -controller-threads=10
- -resync-interval=30
- -logtostderr
- -enable-batch-scheduler=true
- -enable-metrics=true
- -metrics-labels=project
- -metrics-port=10254
- -metrics-endpoint=/metrics
- -metrics-prefix=
image: hub.oa.com/runzhliu/spark-operator:latest-template
imagePullPolicy: Always
name: sparkoperator
volumeMounts:
- name: template
mountPath: /etc/config
ports:
- containerPort: 10254
protocol: TCP
name: metrics
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 40
|
警告
本文最后更新于 2024年1月9日,文中内容可能已过时,请谨慎参考。