Skip to main content
Stack Overflow
  1. About
  2. For Teams

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Required fields*

Empty Spark dataset reading Hive table

There are 2 Hive tables created using the same sources and same logic, but with slightly different queries:

Table 1 query is:

create table test.table1 stored as orc as
 select
 f1,
 mc.f2 as f2,
 mc.f3 as f3,
 f4
 from src.test_table lateral view explode(multiple_field) mcTable as mc
 union all
 select
 f1,
 f5 as f2,
 f6 as f3,
 f4
 from src.test_table
 where multiple_field is null or size(multiple_field) < 1
;

Next, table 2 query - same logic, shortened using outer explode:

 create table test.table2 stored as orc as
 select
 f1,
 if(mc is null, f5, mc.f2) as f2,
 if(mc is null, f6, mc.f3) as f3,
 f4
 from src.test_table lateral view outer explode(multiple_field) mcTable as mc
;

Both tables were created successfully, contain the same number of rows and identical data (checked by Hive Beeline client). Then I try to read table's data with Spark:

SparkSession sparkSession = SparkSession
 .builder().config("hive.execution.engine","mr")
 .appName("OrcExportJob")
 .enableHiveSupport()
 .getOrCreate();
String hql = "select * from test.table1"; // or test.table2
Dataset<Row> sqlDF = sparkSession.sql(hql);

In case of test.table2 it's OK - sqlDF contains all the data. Reading test.table1 leads to different result - sqlDF contains no data at all (0 rows). Spark logs shows no errors - just like the table is really empty.

I've heard Spark has some problems reading transactional or partitioned Hive tables - but this isn't the case.

Digging around I explored that Hive stores ORC files for my tables in different ways:

/
├─ user/
│ ├─ hive/
│ │ ├─ warehouse/
│ │ │ ├─ test.db/
│ │ │ │ ├─ table1/
│ │ │ │ │ ├─ 1/
│ │ │ │ │ │ ├─ 1/
│ │ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 2/
│ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ │ ├─ 000001_0
│ │ │ │ │ │ ├─ 000002_0
│ │ │ │ │ │ ├─ 000003_0
│ │ │ │ ├─ table2/
│ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 000001_0
│ │ │ │ │ ├─ 000002_0
│ │ │ │ │ ├─ 000003_0

Could someone help me figure out the reason Spark doesn't see Table 1 data?

Why does Hive keep 5 files with complex directory structure for Table 1 and only 4 files with a simple structure for Table 2?

Could it somehow affect Spark reading process?

P.S. Hive version is 2.3.3, Spark version is 2.4.4

Answer*

Draft saved
Draft discarded
Cancel

AltStyle によって変換されたページ (->オリジナル) /