2

There are 2 Hive tables created using the same sources and same logic, but with slightly different queries:

Table 1 query is:

create table test.table1 stored as orc as
 select
 f1,
 mc.f2 as f2,
 mc.f3 as f3,
 f4
 from src.test_table lateral view explode(multiple_field) mcTable as mc
 union all
 select
 f1,
 f5 as f2,
 f6 as f3,
 f4
 from src.test_table
 where multiple_field is null or size(multiple_field) < 1
;

Next, table 2 query - same logic, shortened using outer explode:

 create table test.table2 stored as orc as
 select
 f1,
 if(mc is null, f5, mc.f2) as f2,
 if(mc is null, f6, mc.f3) as f3,
 f4
 from src.test_table lateral view outer explode(multiple_field) mcTable as mc
;

Both tables were created successfully, contain the same number of rows and identical data (checked by Hive Beeline client). Then I try to read table's data with Spark:

SparkSession sparkSession = SparkSession
 .builder().config("hive.execution.engine","mr")
 .appName("OrcExportJob")
 .enableHiveSupport()
 .getOrCreate();
String hql = "select * from test.table1"; // or test.table2
Dataset<Row> sqlDF = sparkSession.sql(hql);

In case of test.table2 it's OK - sqlDF contains all the data. Reading test.table1 leads to different result - sqlDF contains no data at all (0 rows). Spark logs shows no errors - just like the table is really empty.

I've heard Spark has some problems reading transactional or partitioned Hive tables - but this isn't the case.

Digging around I explored that Hive stores ORC files for my tables in different ways:

/
├─ user/
│ ├─ hive/
│ │ ├─ warehouse/
│ │ │ ├─ test.db/
│ │ │ │ ├─ table1/
│ │ │ │ │ ├─ 1/
│ │ │ │ │ │ ├─ 1/
│ │ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 2/
│ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ │ ├─ 000001_0
│ │ │ │ │ │ ├─ 000002_0
│ │ │ │ │ │ ├─ 000003_0
│ │ │ │ ├─ table2/
│ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 000001_0
│ │ │ │ │ ├─ 000002_0
│ │ │ │ │ ├─ 000003_0

Could someone help me figure out the reason Spark doesn't see Table 1 data?

Why does Hive keep 5 files with complex directory structure for Table 1 and only 4 files with a simple structure for Table 2?

Could it somehow affect Spark reading process?

P.S. Hive version is 2.3.3, Spark version is 2.4.4

asked Nov 24, 2021 at 14:41

1 Answer 1

3

Normally data files are located inside table location without subdirectories.

UNION ALL is being optimized (most probably you are using Tez) and each query is running in parallel, independently as separated mapper tasks. This requires separate subdirectories to be created for each query in UNION ALL to make it possible to write results of each query simultaneously, this is why you have two directories.

These settings allow Hive to read subdirectories:

set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;

There is an issue SPARK-26663 - Cannot query a Hive table with subdirectories - closed as cannot reproduce because they executed steps to reproduce on MR instead of Tez.

If you need to read such tables try to use HiveContext and set above properties.

BTW your second query is more efficient because you are reading source table only once and does not create subdirectories.

Also you can try running you CREATE TABLE on MR, it will not create sub-dirs (set hive.execution.engine=mr;).

Also wrapping UNION ALL into subquery and adding something like DISTRIBUTE BY or ORDER will force additional reduce step, see https://stackoverflow.com/a/54998031/2700344

answered Nov 24, 2021 at 15:15
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.