概述
Spark Track Server 之前使用的时候一直感觉有时候可以有时候又访问失败,失败情况。
Replay
找出出问题的节点。
错误日志如下。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
2019-06-22T12:04:40.35760024+08:00 2019-06-22 12:04:40 INFO FabricUtils$:19 - k8s master : https://shlxjsv18me1.k8s.so.db:6443
2019-06-22T12:04:40.701046264+08:00 2019-06-22 12:04:40 INFO MainLogicServlet:22 - pod persona:b648b0a4-932d-11e9-b69f-0a58061025ed status is false
2019-06-22T12:04:42.204143069+08:00 2019-06-22 12:04:42 INFO Utils$:29 - Read timed out
2019-06-22T12:04:42.204165094+08:00 java.net.SocketTimeoutException: Read timed out
2019-06-22T12:04:42.204170352+08:00 at java.net.SocketInputStream.socketRead0(Native Method)
2019-06-22T12:04:42.204175359+08:00 at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
2019-06-22T12:04:42.204180337+08:00 at java.net.SocketInputStream.read(SocketInputStream.java:170)
2019-06-22T12:04:42.204185096+08:00 at java.net.SocketInputStream.read(SocketInputStream.java:141)
2019-06-22T12:04:42.204189885+08:00 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
2019-06-22T12:04:42.204194544+08:00 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
2019-06-22T12:04:42.204198867+08:00 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
2019-06-22T12:04:42.204203285+08:00 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
2019-06-22T12:04:42.204207766+08:00 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
2019-06-22T12:04:42.20421656+08:00 at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
2019-06-22T12:04:42.204221328+08:00 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
2019-06-22T12:04:42.204225921+08:00 at Utils$.httpGetInner(Utils.scala:18)
2019-06-22T12:04:42.204230093+08:00 at Utils$.httpGet(Utils.scala:30)
2019-06-22T12:04:42.204234216+08:00 at MainLogicServlet.processHistory(MainLogicServlet.scala:39)
2019-06-22T12:04:42.204238427+08:00 at MainLogicServlet.doGet(MainLogicServlet.scala:33)
2019-06-22T12:04:42.204242672+08:00 at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
2019-06-22T12:04:42.204246899+08:00 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
2019-06-22T12:04:42.204251126+08:00 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
2019-06-22T12:04:42.20425557+08:00 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
2019-06-22T12:04:42.204260156+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
2019-06-22T12:04:42.204264762+08:00 at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
2019-06-22T12:04:42.204269124+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
2019-06-22T12:04:42.20427376+08:00 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
2019-06-22T12:04:42.204278184+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
2019-06-22T12:04:42.204284783+08:00 at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
2019-06-22T12:04:42.204289268+08:00 at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
2019-06-22T12:04:42.204293648+08:00 at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
2019-06-22T12:04:42.204298117+08:00 at org.eclipse.jetty.server.Server.handle(Server.java:531)
2019-06-22T12:04:42.204302842+08:00 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
2019-06-22T12:04:42.204307277+08:00 at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
|
从日志看,只能看到 Spark Track Server 一直报 Read Timeout
的问题。一开始以为是 timeout
时间设置过短的问题。但是实时的 Spark Web UI 几乎没有遇到过超时的问题,而 Spark Web UI 传输的 InputStream 也可能会很大的,所以这也跟 History Server 数据量关系不大。
所以只能到 History Server 的日志具体看看,结果发现如下错误信息。
Spark history server fails to render compressed inprogress history file in some cases,最后发现这是一个 Spark JIRA 上提到过的问题。
可惜的是,修复的代码是在 2.2.1 以后的版本了,但是我们的 History Server 是基于 Spark 2.2.0 的,显然代码里没有包含修复的逻辑。
修复
解决的办法就是升级 Spark History Server 的版本。基于 Spark 2.3.0 重新构建一个镜像就可以了。
警告
本文最后更新于 2017年2月1日,文中内容可能已过时,请谨慎参考。