查询Hive表时,Dataframe NumberFormatException上的Spark 2.2 Thrift服务器错误
问题内容:
我有运行Spark2(v2.2)的Hortonworks HDP 2.6.3。我的测试用例非常简单:
-
用一些随机值创建一个Hive表。在端口10000配置单元
-
在10016打开Spark Thrift服务器
-
运行pyspark并通过10016查询Hive表
但是,由于NumberFormatException,我无法从Spark获取数据。
这是我的测试用例:
- 使用示例行创建Hive表:
beeline> !connect jdbc:hive2://localhost:10000/default hive hive create table test1 (id int, desc varchar(40)); insert into table test1 values (1,"aa"),(2,"bb");
- 运行Spark Thrift服务器:
su-spark -c’/usr/hdp/2.6.3.0-235/spark2/sbin/start-thriftserver.sh –master
yarn-client –executor-memory 512m –hiveconf hive.server2.thrift.port = 10016
‘
-
以spark用户su身份运行pyspark-spark -c’pyspark’
-
输入以下代码:
df = sqlContext.read.format(“ jdbc”)。options(driver =“
org.apache.hive.jdbc.HiveDriver”,url =“ jdbc:hive2:// localhost:10016 /
default”,dbtable =“ test1” ,user =“ hive”,password =“ hive”)。load()
df.select(“ *”)。show()
- 我收到此错误:
15/12/15 08:04:13错误执行程序:阶段2.0(TID
2)中的任务0.0中的异常java.sql.SQLException:无法将列1转换为integerjava.lang.NumberFormatException:对于组织中的输入字符串:“
id”
.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)位于org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
$$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $
JdbcUtils $$ makeGetter $
6.apply(JdbcUtils.scala:394)位于org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
$$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $
JdbcUtils org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon
$ 1.getNext(JdbcUtils.scala:330)上的$$ makeGetter $
6.apply(JdbcUtils.scala:393)在org.apache.spark.sql上.execution.datasources.jdbc.JdbcUtils
$$ anon $
1.getNext(JdbcUtils.scala:312)在组织中。org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)处的apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)在org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:
32)在org.apache的org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)的org.apache.spark.sql.catalyst.expressions.GeneratedClass
$ GeneratedIterator.processNext(Unknown Source)处。
spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $ anon $
1.hasNext(WholeStageCodegenExec.scala:395)at
org.apache.spark.sql.execution.SparkPlan $$ anonfun $
2.apply(SparkPlan.scala:234) org.apache.spark.sql.execution.SparkPlan $$
anonfun $ 2.apply(SparkPlan.scala:228)在org.apache.spark.rdd.RDD $$ anonfun $
mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD
.scala:827),网址为org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:org.apache.spark.rdd.RDD
$$ anonfun $ mapPartitionsInternal $ 1 $ anonfun $ apply $
25.apply(RDD.scala:827)。
38)在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)在org.apache.spark.scheduler.ResultTask
org.apache.spark.scheduler.Task.run(Task.scala:108)的org.apache.spark.executor.Executor
$
TaskRunner.run(Executor.scala:338)的.runTask(ResultTask.scala:87)位于java.util.concurrent.ThreadPoolExecutor
$
Worker.run(ThreadPoolExecutor.java:617)的java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)位于java.lang.Thread.run(Thread.java:745)原因:java.lang.NumberFormatException:对于输入字符串:java.lang上的“
id”。org.apache.hive.jdbc.HiveBaseResultSet处的java.lang.Integer.parseInt(Integer.java:580)处的NumberFormatException.forInputString(NumberFormatException.java:65)在java.lang.Integer.valueOf(Integer.java:766)
.getInt(HiveBaseResultSet.java:346)… 23年3月17日15:04:13 WARN
TaskSetManager:在阶段2.0中丢失了任务0.0(TID
2,本地主机,执行器驱动程序):java.sql.SQLException:无法将列1转换为integerjava.lang.NumberFormatException:对于输入字符串:org.apache.spark.sql.execution.datasources.jdbc上的org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)处的“
id” .org.apache.spark.sql.execution.datasources.jdbc中的.JdbcUtils $ anonfun $
org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$
makeGetter $ 6.apply(JdbcUtils.scala:394)。JdbcUtils $$ anonfun $ org $
apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$
makeGetter $
6.apply(JdbcUtils.scala:393)位于org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
$ org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $
1.getNext(JdbcUtils.scala:312)在org.apache.spark.util处的$ anon $
1.getNext(JdbcUtils.scala:330)。
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)的org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)的NextIterator.hasNext(NextIterator.scala:73)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)上的.spark.sql.catalyst.expressions.GeneratedClass
$
GeneratedIterator.processNext(未知源)在org.apache.spark.sql.execution上的。WholeStageCodegenExec
$$ anonfun $ 8 $$ anon $
1.hasNext(WholeStageCodegenExec.scala:395)在org.apache.spark.sql.execution.SparkPlan
$$ anonfun $ 2.apply(SparkPlan.scala:234)在org.apache.spark。
sql.execution.SparkPlan $$ anonfun $
2.apply(SparkPlan.scala:228)位于org.apache.spark.rdd.RDD $$ anonfun $
mapPartitionsInternal $ 1 $ anonfun $ apply $ 25.apply(RDD.scala:827)
org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $
apply $ 25.apply(RDD.scala:827)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)上的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)org.apache.spark.scheduler.ResultTask.runTask(
org.apache.spark.scheduler.Task.run(ResultTask.scala:87)(org.apache.spark.executor处的Task.scala:108)。Java上的Executor
$
TaskRunner.run(Executor.scala:338)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在java.util.concurrent.ThreadPoolExecutor
$ Worker.run(ThreadPoolExecutor.java:617)在Java
.lang.Thread.run(Thread.java:745)原因:java.lang.NumberFormatException:对于输入字符串:java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)处的“
id”
org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346)的java.lang.Integer.valueOf(Integer.java:766)的.parseInt(Integer.java:580)…另外23个Thread.run(Thread.java:745)原因:java.lang.NumberFormatException:对于输入字符串:java.lang.Integer.parseInt(java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)处的“
id”
org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346)处的java.lang.Integer.valueOf(Integer.java:766)处的Integer.java:580
…
23更多Thread.run(Thread.java:745)原因:java.lang.NumberFormatException:对于输入字符串:java.lang.Integer.parseInt(java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)处的“
id”
org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346)处的java.lang.Integer.valueOf(Integer.java:766)处的Integer.java:580
… 23更多15/12/17 08:04:14错误TaskSetManager:阶段2.0中的任务0失败1次;正在中止作业回溯(最近一次呼叫最近):在文件“
/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py”中的文件“”,行1,行336,在show
print(self ._jdf.showString(n,20))文件“
/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”,行1133,在
调用中 在装饰返回f( a,* kw)中的文件“
/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py”,第63行,文件“ / usr /
hdp / current / spark2- client / python / lib / py4j-0.10.4-src.zip / py4j /
protocol.py“,行319,在get_return_value
py4j.protocol.Py4JJavaError中:调用o75.showString时发生错误。:org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段2.0中的任务0失败1次,最近一次失败:阶段2.0中的任务0.0丢失(TID
2,本地主机,执行程序驱动程序):java.sql.SQLException
:无法将列1转换为integerjava.lang.NumberFormatException:对于输入字符串:org.apache.spark.sql.execution.datasources上的org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)上的“
id” .jdbc。驱动程序堆栈跟踪:位于org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $
scheduler $ DAGScheduler $$
failJobAndIndependentStages(DAGScheduler.scala:1517),位于org.apache.spark.scheduler.DAGScheduler
$$ anonfun $ abortStage $ 1。在org.apache.spark.scheduler.DAGScheduler $$
anonfun $ abortStage $
1.apply(DAGScheduler.scala:1504)处应用(DAGScheduler.scala:1505)在scala.collection.mutable.ResizableArray
$ class.foreach(ResizableArray.scala:
59)在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)处的scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在org.apache.spark.scheduler.DAGScheduler
$$ org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed
$ 1.apply(DAGScheduler.scala:814)上的anonfun $ handleTaskSetFailed $
1.apply(DAGScheduler.scala:814)在scala.Option。org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)处的org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)处的foreach(Option.scala:257)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)上的.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)位于org.apache.spark.util.EventLoop
$$ anon $ 1。
(EventLoop.scala:48)在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)在org.apache.spark。
org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)的org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala)的SparkContext.runJob(SparkContext.scala:2050):336),位于org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38),位于org.apache.spark.sql.Dataset.org
$ apache $ spark $ sql $ Dataset $$ collectFromPlan(Dataset.scala
:2854),位于org.apache.spark.sql.Dataset $$ anonfun $ head $
1.apply(Dataset.scala:2154),位于org.apache.spark.sql.Dataset $$ anonfun $ head
$ 1.apply(Dataset.scala :2154),位于org.apache.spark.sql.Dataset $$ anonfun $
55.apply(Dataset.scala:2838),位于org.apache.spark.sql.execution.SQLExecution $
.withNewExecutionId(SQLExecution.scala:65)
org.apache.spark.sql.Dataset.take(Dataset)上的.apache.spark.sql.Dataset.withAction(Dataset.scala:2837)在org.apache.spark.sql.Dataset.head(Dataset.scala:2154)
.scala:2367),位于org.apache.spark.sql.Dataset.showString(Dataset.scala:245),位于sun.reflect.NativeMethodAccessorImpl。调用sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)上的invoke0(本机方法),调用java.lang.reflect.Method.invoke(Method.java:上的sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)上的498)在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在py4j.Gateway.invoke(Gateway.java:280)在py4j.commands。
py4j.commands.CallCommand.execute(CallCommand.java:79)上的AbstractCommand.invokeMethod(AbstractCommand.java:132)py4j.GatewayConnection.run(GatewayConnection.java:214)上的java.lang.Thread.run(Thread.java)
:745)原因:java.sql.SQLException:无法将列1转换为integerjava.lang.NumberFormatException:对于输入字符串:org.apache.hive中的“
id”。jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)位于org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
$$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $
JdbcUtils $$ makeGetter $
6.apply(JdbcUtils.scala:394),位于org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
$$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $
JdbcUtils $$ makeGetter $
6。在org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $
1.getNext(JdbcUtils.scala:330)处org.apache.spark.sql.execution.datasources申请(JdbcUtils.scala:393)。
jdbc.JdbcUtils $$ anon $
1.getNext(JdbcUtils.scala:312)在org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)在org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:
37)在org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)在org.apache的org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)的org.apache.spark.sql.catalyst.expressions.GeneratedClass
$ GeneratedIterator.processNext(Unknown Source)处。
spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $ anon $
1.hasNext(WholeStageCodegenExec.scala:395)at
org.apache.spark.sql.execution.SparkPlan $$ anonfun $
2.apply(SparkPlan.scala:234)在org.apache.spark.sql.execution.SparkPlan $$
anonfun $ 2.apply(SparkPlan.scala:228)在org.apache.spark.rdd.RDD $$ anonfun $
mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD
.scala:827),位于org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $
1 $$ anonfun $ apply $
25.apply(RDD.scala:827),位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD
.scala:38),网址为org.apache.spark.rdd.RDD。org.apache.spark.rdd上的computeOrReadCheckpoint(RDD.scala:323)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)上的org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)上的RDD.iterator(RDD.scala:287)
org.apache.spark.executor.Executor $
TaskRunner.run(Executor.scala:338)上的.spark.scheduler.Task.run(Task.scala:108)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java
:1142)at java.util.concurrent.ThreadPoolExecutor $
Worker.run(ThreadPoolExecutor.java:617)…
1更多原因:java.lang.NumberFormatException:对于输入字符串:java.lang.NumberFormatException的“
id”。
org.apache.hive.jdbc.HiveBaseResultSet.getInt上的java.lang.Integer.parseInt(Integer.java:580)上的forInputString(NumberFormatException.java:65)在java.lang.Integer.valueOf(Integer.java:766)上(HiveBaseResultSet.java:346)…另外23个
-
我怀疑它与
id
列有关,因此我更改为:df.select(“ desc”)。show() -
然后我得到了这个奇怪的结果:
+----+ |desc| +----+ |desc| |desc| +----+
- 如果我返回Hive进行查询,则通过端口10016一切正常:
beeline> !connect jdbc:hive2://localhost:10016/default hive hive select * from test1; +-----+-------+--+ | id | desc | +-----+-------+--+ | 1 | aa | | 2 | bb | +-----+-------+--+
- 如果我在pyspark中更改端口10000,同样的问题仍然存在。
您能否帮助我理解为什么以及如何通过Spark获取行?
更新1
我在两种情况下都遵循@Achyuth的建议,但它们仍然无法正常工作。
情况1
直线:
create table test4 (id String, desc String);
insert into table test4 values ("1","aa"),("2","bb");
select * from test4;
Pyspark:
>>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test4",user="hive", password="hive").option("fetchsize", "10").load()
>>> df.select("*").show()
+---+----+
| id|desc|
+---+----+
| id|desc|
| id|desc|
+---+----+
由于某种原因,它在列名中返回了?!
情况二
直线:
create table test5 (id int, desc varchar(40)) STORED AS ORC;
insert into table test5 values (1,"aa"),(2,"bb");
select * from test5;
Pyspark:
还是一样的错误 Caused by: java.lang.NumberFormatException: For input string: "id"
更新2
创建一个表并通过Hive端口10000插入值,然后查询它。通过beeline可以正常工作
beeline> !connect jdbc:hive2://localhost:10000/default hive hive
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37)
Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/default> create table test2 (id String, desc String) STORED AS ORC;
No rows affected (0.3 seconds)
0: jdbc:hive2://localhost:10000/default> insert into table test2 values ("1","aa"),("2","bb");
INFO : Session is already open
INFO : Dag name: insert into table tes..."1","aa"),("2","bb")(Stage-1)
INFO : Tez session was closed. Reopening...
INFO : Session re-established.
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1514019042819_0006)
INFO : Map 1: -/-
INFO : Map 1: 0/1
INFO : Map 1: 0(+1)/1
INFO : Map 1: 1/1
INFO : Loading data to table default.test2 from webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test2/.hive-staging_hive_2017-12-23_04-29-54_569_601147868480753216-3/-ext-10000
INFO : Table default.test2 stats: [numFiles=1, numRows=2, totalSize=317, rawDataSize=342]
No rows affected (15.414 seconds)
0: jdbc:hive2://localhost:10000/default> select * from table2;
Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'table2' (state=42S02,code=10001)
0: jdbc:hive2://localhost:10000/default> select * from test2;
+-----------+-------------+--+
| test2.id | test2.desc |
+-----------+-------------+--+
| 1 | aa |
| 2 | bb |
+-----------+-------------+--+
2 rows selected (0.364 seconds)
同样通过beeline,我可以使用Spark Thrift Server 10016做同样的事情,并且运行良好:
beeline> !connect jdbc:hive2://localhost:10016/default hive hive
Connecting to jdbc:hive2://localhost:10016/default
1: jdbc:hive2://localhost:10016/default> create table test3 (id String, desc String) STORED AS ORC;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (1.234 seconds)
1: jdbc:hive2://localhost:10016/default> insert into table test3 values ("1","aa"),("2","bb");
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (9.111 seconds)
1: jdbc:hive2://localhost:10016/default> select * from test3;
+-----+-------+--+
| id | desc |
+-----+-------+--+
| 1 | aa |
| 2 | bb |
+-----+-------+--+
2 rows selected (3.387 seconds)
这意味着Spark和Thrift Server可以正常工作。但是使用pyspark
我遇到了同样的问题,因为结果为空:
>>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test3",user="hive", password="hive").load()
>>> df.select("*").show()
+---+----+
| id|desc|
+---+----+
+---+----+
更新3
描述扩展测试3;
# Detailed Table Information | CatalogTable(
Table: `default`.`test3`
Owner: hive
Created: Sat Dec 23 04:37:14 PST 2017
Last Access: Wed Dec 31 16:00:00 PST 1969
Type: MANAGED
Schema: [`id` string, `desc` string]
Properties: [totalSize=620, numFiles=2, transient_lastDdlTime=1514032656, STATS_GENERATED_VIA_STATS_TASK=true]
Storage(Location: webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test3, InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, Serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde, Properties: [serialization.format=1]))
显示创建表test3;
CREATE TABLE `test3`(`id` string, `desc` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'totalSize' = '620',
'numFiles' = '2',
'transient_lastDdlTime' = '1514032656',
'STATS_GENERATED_VIA_STATS_TASK' = 'true'
)
su-spark -c’hdfs dfs -cat webhdfs://demo.myapp.local:40070 / apps / hive /
warehouse / test3 / part-00000’
问题答案:
即使您正在创建具有特定数据类型的配置单元表,插入时表中的基础数据也将以字符串格式存储。
因此,当spark尝试读取数据时,它将使用metastore查找数据类型。它在配置单元元存储中以int形式出现,在文件中以字符串形式出现,并引发强制转换异常。
解决方案
将表创建为字符串,并从spark读取数据即可。
create table test1 (id String, desc String);
如果要保留数据类型,请指定创建表的文件格式(例如orc或parquet)之一,然后将其插入。您可以无例外地从Spark读取文件
create table test1 (id int, desc varchar(40) STORED AS ORC);
现在的问题是为什么蜂巢能够阅读它? 蜂巢具有良好的演员选择,而火花却没有。