目录

Spark和Kerberos

6 Hadoop Security Guide

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_Security_Guide/content/kerberos-overview.html

To create secure communication among its various components, HDP uses Kerberos. Kerberos is a third-party authentication mechanism, in which users and services that users wish to access rely on the Kerberos server to authenticate each to the other. This mechanism also supports encrypting all traffic between the user and the service.

The Kerberos server itself is known as the Key Distribution Center, or KDC. At a high level, it has three parts:

  • A database of users and services (known as principals) and their respective Kerberos passwords
  • An authentication server (AS) which performs the initial authentication and issues a Ticket Granting Ticket (TGT)
  • A Ticket Granting Server (TGS) that issues subsequent service tickets based on the initial TGT.

A user principal requests authentication from the AS 认证服务-Authentication Server. The AS returns a TGT 票据-Ticket Granting Ticket that is encrypted using the user principal’s Kerberos password, which is known only to the user principal and the AS. The user principal decrypts the TGT locally using its Kerberos password, and from that point forward, until the ticket expires, the user principal can use the TGT to get service tickets from the TGS 票据发行服务-Ticket Granting Server.

Because a service principal cannot provide a password each time to decrypt the TGT, it uses a special file, called a keytab, which contains its authentication credentials 认证信息.

The service tickets allow the principal to access various services. The set of hosts, users, and services over which the Kerberos server has control is called a realm.

5 Kerberos user authentication for Spark workload

默认 Hadoop 各个组件间无任何认证,因此可以恶意伪装某一组件(比如 NameNode)接入到集群中搞破坏。而通过 kerberos,可以将密钥事先放到可靠的节点上并只允许有限制的访问,该节点的服务启动时读取密钥,并与 kerberos 交互以做认证,从而接入到 Hadoop 集群中。注意,我们这里主要是针对服务与服务之间的安全认证,没有涉及 user。

用户要去游乐场,首先要在门口检查用户的身份(即 CHECK 用户的 ID 和 PASS),如果用户通过验证,游乐场的门卫 (AS) 即提供给用户一张门卡 (TGT)

这张卡片的用处就是告诉游乐场的各个场所,用户是通过正门进来,而不是后门偷爬进来的,并且也是获取进入场所一把钥匙。

现在用户有张卡,但是这对用户来不重要,因为用户来游乐场不是为了拿这张卡的而是为了游览游乐项目,这时用户摩天楼,并想游玩。

这时摩天轮的服务员 (client) 拦下用户,向用户要求摩天轮的 (ST) 票据,用户说用户只有一个门卡 (TGT), 那用户只要把 TGT 放在一旁的票据授权机 (TGS)上刷一下。 票据授权机 (TGS) 就根据用户现在所在的摩天轮,给用户一张摩天轮的票据 (ST), 这样用户有了摩天轮的票据,现在用户可以畅通无阻的进入摩天轮里游玩了。

当然如果用户玩完摩天轮后,想去游乐园的咖啡厅休息下,那用户一样只要带着那张门卡 (TGT)到相应的咖啡厅的票据授权机 (TGS) 刷一下,得到咖啡厅的票据 (ST) 就可以进入咖啡厅。 当用户离开游乐场后,想用这张 TGT 去刷打的回家的费用,对不起,用户的 TGT 已经过期了,在用户离开游乐场那刻开始,用户的 TGT 就已经销毁了。

Principal 可以理解为用户或服务的名字,全集群唯一,由三部分组成:

1
username(or servicename)/instance@realm

例如:nn/zelda1@ZELDA.COM。

  • zelda1: 为集群中的一台机器;或admin/admin@ZELDA.COM,管理员账户
  • username or servicename: 在本文里为服务,HDFS的2个服务分别取名为 nn 和 dn,即 namenode 和 datanode
  • instance: 在本文里为具体的 FQDN 机器名,用来保证全局唯一(比如多个 datanode 节点,各节点需要各自独立认证)
  • realm: 域,我这里为 ZELDA.COM

对于 HDFS 的各个服务来说,keytab 更合适,生成一个 keytab 文件,其包含一个或多个 principal+key 对,例如在 HDFS 配置文件里为 nn 指定的 keytab 文件:

1
2
3
4
<property>
  <name>dfs.namenode.keytab.file</name>
  <value>/etc/security/nn.service.keytab</value>
</property>

4 Submitting Spark batch applications to Kerberos-enabled HDFS with keytab

Hadoop 的 core-site 配置参考。

1
2
3
4
5
6
7
8
<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value>
</property>
<property>
  <name>hadoop.security.authorization</name>
  <value>true</value>
</property>

You can submit Spark batch applications from the cluster management console (on the My Applications & Notebooks page or the Spark Instance Groups page), by using ascd Spark RESTful APIs, or by using the spark-submit command in the Spark deployment directory.

Submit a Spark batch application using the following spark-submit syntax for keytab authentication:

1
2
3
4
5
6
spark-submit \
--master spark://spark_master_url \
-–conf spark.yarn.keytab=path_to_keytab \
-–conf spark.yarn.principal=principal@REALM.COM \ 
--class main-class \
application-jar hdfs://namenode:9000/path/to/input
  • spark://spark_master_url identifies the master URL of the Spark instance group to submit the Spark batch application.
  • spark.yarn.keytab=path_to_keytab specifies the full path to the file that contains the keytab for the specified principal, for example, /home/test/test.keytab. Ensure that the execution user for the Spark driver consumer in the Spark instance group has access to the keytab file.
  • spark.yarn.principal=principal@REALM.COM specifies the principal used to log in to the KDC while running on Kerberos-enabled HDFS, for example, user@EXAMPLE.COM.
  • hdfs://namenode:9000/path/to/input specifies the fully qualified URL of the HDFS Namenode. Submitting workload with keytab enables the HDFS delegation token to be refreshed and generates the Spark YARN credential file in the home directory of the submission user in HDFS. Ensure that this directory already exists in HDFS.

3 YARN、Spark、Hive使用kerberos

https://blog.csdn.net/dxl342/article/details/56006001

Spark 作业认证

目的是 spark-submit 提交作业的时候,能够接入到 kerberos 中,从而向 YARN 提交作业、访问 HDFS 等等(Kerberized HDFS)。

针对 spark-submit 的任务,有两种办法通过 kerberos 认证:

  1. kinit -k -t /etc/security/xx.keytab user/host@REALM.COM,然后 spark-submit 提交即可
  2. 作为参数提供给 spark-submit--keytab /etc/security/dtdream.zelda1.keytab --principaldtdream/zelda1@ZELDA.COM

总之还是以 keytab 的方式来的,只是从当前 Principal 缓存里读取,还是自己从 keytab 里读取。

1 新增一个Principal:

1
2
addprinc -randkey dtdream/zelda1@ZELDA.COM
xst -k dtdream.spark.keytab dtdream/zelda1@ZELDA.COM

2 将生成的 dtdream.spark.keytab 文件拷贝到 /etc/security/

3 kinit 后提交作业

1
2
3
4
5
6
7
8
9
kinit -kt /etc/security/dtdream.spark.keytab  dtdream/zelda1

klist ## 检查Principal缓存

./bin/spark-submit \
--master yarn \
--class org.apache.spark.examples.SparkLR \
--name SparkLR \
lib/spark-examples-1.6.1-hadoop2.6.0.jar

或者跳过 kinit 直接指定 keytab 路径:

1
2
3
4
5
6
7
./bin/spark-submit \
--keytab /etc/security/dtdream.zelda1.keytab \ 
--principal dtdream/zelda1@ZELDA.COM \
--master yarn \
--class org.apache.spark.examples.SparkLR \
--name SparkLR \
lib/spark-examples-1.6.1-hadoop2.6.0.jar

spark sql 的 thriftserver 是作为一个 spark 作业,通过 spark-submit 提交给 yarn 的,启动之前需要设置 kinit 或者指定 keytab 由 spark-submit 自己 loginfromkeytab。

spark-submit 还可以指定 –proxy-user 参数,可以模拟其他用户来提交 job。

2 Spark 与 Hadoop 系统的交互

https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html

Running in a Secure Cluster

As covered in security, Kerberos is used in a secure Hadoop cluster to authenticate principals associated with services and clients. This allows clients to make requests of these authenticated services; the services to grant rights to the authenticated principals.

Hadoop services issue hadoop tokens to grant access to the services and data. Clients must first acquire tokens for the services they will access and pass them along with their application as it is launched in the YARN cluster.

For a Spark application to interact with any of the Hadoop filesystem (for example hdfs, webhdfs, etc), HBase and Hive, it must acquire the relevant tokens using the Kerberos credentials of the user launching the application —that is, the principal whose identity will become that of the launched Spark application.

This is normally done at launch time: in a secure cluster Spark will automatically obtain a token for the cluster’s default Hadoop filesystem, and potentially for HBase and Hive.

An HBase token will be obtained if HBase is in on classpath, the HBase configuration declares the application is secure (i.e. hbase-site.xml sets hbase.security.authentication to kerberos), and spark.security.credentials.hbase.enabled is not set to false.

Similarly, a Hive token will be obtained if Hive is on the classpath, its configuration includes a URI of the metadata store in hive.metastore.uris, and spark.security.credentials.hive.enabled is not set to false.

If an application needs to interact with other secure Hadoop filesystems, then the tokens needed to access these clusters must be explicitly requested at launch time. This is done by listing them in the spark.yarn.access.hadoopFileSystems property.

1
spark.yarn.access.hadoopFileSystems hdfs://ireland.example.org:8020/,webhdfs://frankfurt.example.org:50070/

Spark supports integrating with other security-aware services through Java Services mechanism (see java.util.ServiceLoader). To do that, implementations of org.apache.spark.deploy.yarn.security.ServiceCredentialProvider should be available to Spark by listing their names in the corresponding file in the jar’s META-INF/services directory. These plug-ins can be disabled by setting spark.security.credentials.{service}.enabled to false, where {service} is the name of credential provider.

1 Spark 关于 Kerberos 的描述

https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html

Troubleshooting Kerberos

Debugging Hadoop/Kerberos problems can be “difficult”. One useful technique is to enable extra logging of Kerberos operations in Hadoop by setting the HADOOP_JAAS_DEBUG environment variable.

1
bash export HADOOP_JAAS_DEBUG=true

The JDK classes can be configured to enable extra logging of their Kerberos and SPNEGO/REST authentication via the system properties sun.security.krb5.debug and sun.security.spnego.debug=true

1
2
-Dsun.security.krb5.debug=true
-Dsun.security.spnego.debug=true

All these options can be enabled in the Application Master:

1
2
3
4
spark.yarn.appMasterEnv.HADOOP_JAAS_DEBUG true
spark.yarn.am.extraJavaOptions
-Dsun.security.krb5.debug=true
-Dsun.security.spnego.debug=true

Finally, if the log level for org.apache.spark.deploy.yarn.Client is set to DEBUG, the log will include a list of all tokens obtained, and their expiry details

大意就是 Spark 关于 Kerberos 的问题非常难 debug

警告
本文最后更新于 2017年2月1日,文中内容可能已过时,请谨慎参考。