目录

Spark-Track-Server问题排查

概述

Spark Track Server 之前使用的时候一直感觉有时候可以有时候又访问失败,失败情况。

/spark-track-server%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5/image_1ddulsjji1qpqqgo1bbk18rl1c3cm.png

Replay

找出出问题的节点。

/spark-track-server%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5/image_1ddulv470rkklhvpruuuuet13.png

错误日志如下。

/spark-track-server%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5/image_1ddulql0gqefhkf124e153d1ej69.png
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
2019-06-22T12:04:40.35760024+08:00 2019-06-22 12:04:40 INFO FabricUtils$:19 - k8s master : https://shlxjsv18me1.k8s.so.db:6443
2019-06-22T12:04:40.701046264+08:00 2019-06-22 12:04:40 INFO MainLogicServlet:22 - pod persona:b648b0a4-932d-11e9-b69f-0a58061025ed status is false
2019-06-22T12:04:42.204143069+08:00 2019-06-22 12:04:42 INFO Utils$:29 - Read timed out
2019-06-22T12:04:42.204165094+08:00 java.net.SocketTimeoutException: Read timed out
2019-06-22T12:04:42.204170352+08:00 at java.net.SocketInputStream.socketRead0(Native Method)
2019-06-22T12:04:42.204175359+08:00 at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
2019-06-22T12:04:42.204180337+08:00 at java.net.SocketInputStream.read(SocketInputStream.java:170)
2019-06-22T12:04:42.204185096+08:00 at java.net.SocketInputStream.read(SocketInputStream.java:141)
2019-06-22T12:04:42.204189885+08:00 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
2019-06-22T12:04:42.204194544+08:00 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
2019-06-22T12:04:42.204198867+08:00 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
2019-06-22T12:04:42.204203285+08:00 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
2019-06-22T12:04:42.204207766+08:00 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
2019-06-22T12:04:42.20421656+08:00 at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
2019-06-22T12:04:42.204221328+08:00 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
2019-06-22T12:04:42.204225921+08:00 at Utils$.httpGetInner(Utils.scala:18)
2019-06-22T12:04:42.204230093+08:00 at Utils$.httpGet(Utils.scala:30)
2019-06-22T12:04:42.204234216+08:00 at MainLogicServlet.processHistory(MainLogicServlet.scala:39)
2019-06-22T12:04:42.204238427+08:00 at MainLogicServlet.doGet(MainLogicServlet.scala:33)
2019-06-22T12:04:42.204242672+08:00 at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
2019-06-22T12:04:42.204246899+08:00 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
2019-06-22T12:04:42.204251126+08:00 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
2019-06-22T12:04:42.20425557+08:00 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
2019-06-22T12:04:42.204260156+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
2019-06-22T12:04:42.204264762+08:00 at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
2019-06-22T12:04:42.204269124+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
2019-06-22T12:04:42.20427376+08:00 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
2019-06-22T12:04:42.204278184+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
2019-06-22T12:04:42.204284783+08:00 at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
2019-06-22T12:04:42.204289268+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
2019-06-22T12:04:42.204293648+08:00 at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
2019-06-22T12:04:42.204298117+08:00 at org.eclipse.jetty.server.Server.handle(Server.java:531)
2019-06-22T12:04:42.204302842+08:00 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
2019-06-22T12:04:42.204307277+08:00 at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)

从日志看,只能看到 Spark Track Server 一直报 Read Timeout 的问题。一开始以为是 timeout 时间设置过短的问题。但是实时的 Spark Web UI 几乎没有遇到过超时的问题,而 Spark Web UI 传输的 InputStream 也可能会很大的,所以这也跟 History Server 数据量关系不大。

所以只能到 History Server 的日志具体看看,结果发现如下错误信息。

/spark-track-server%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5/image_1ddusmdl91db28m9o5i123r19v11t.png

Spark history server fails to render compressed inprogress history file in some cases,最后发现这是一个 Spark JIRA 上提到过的问题。

/spark-track-server%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5/image_1ddv032fq84j11cm1een3sggo02a.png

可惜的是,修复的代码是在 2.2.1 以后的版本了,但是我们的 History Server 是基于 Spark 2.2.0 的,显然代码里没有包含修复的逻辑。

修复

解决的办法就是升级 Spark History Server 的版本。基于 Spark 2.3.0 重新构建一个镜像就可以了。

警告
本文最后更新于 2017年2月1日,文中内容可能已过时,请谨慎参考。