Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Rev

Required fields*

Empty Spark dataset reading Hive table

There are 2 Hive tables created using the same sources and same logic, but with slightly different queries:

Table 1 query is:

create table test.table1 stored as orc as
 select
 f1,
 mc.f2 as f2,
 mc.f3 as f3,
 f4
 from src.test_table lateral view explode(multiple_field) mcTable as mc
 union all
 select
 f1,
 f5 as f2,
 f6 as f3,
 f4
 from src.test_table
 where multiple_field is null or size(multiple_field) < 1
;

Next, table 2 query - same logic, shortened using outer explode:

 create table test.table2 stored as orc as
 select
 f1,
 if(mc is null, f5, mc.f2) as f2,
 if(mc is null, f6, mc.f3) as f3,
 f4
 from src.test_table lateral view outer explode(multiple_field) mcTable as mc
;

Both tables were created successfully, contain the same number of rows and identical data (checked by Hive Beeline client). Then I try to read table's data with Spark:

SparkSession sparkSession = SparkSession
 .builder().config("hive.execution.engine","mr")
 .appName("OrcExportJob")
 .enableHiveSupport()
 .getOrCreate();
String hql = "select * from test.table1"; // or test.table2
Dataset<Row> sqlDF = sparkSession.sql(hql);

In case of test.table2 it's OK - sqlDF contains all the data. Reading test.table1 leads to different result - sqlDF contains no data at all (0 rows). Spark logs shows no errors - just like the table is really empty.

I've heard Spark has some problems reading transactional or partitioned Hive tables - but this isn't the case.

Digging around I explored that Hive stores ORC files for my tables in different ways:

/
├─ user/
│ ├─ hive/
│ │ ├─ warehouse/
│ │ │ ├─ test.db/
│ │ │ │ ├─ table1/
│ │ │ │ │ ├─ 1/
│ │ │ │ │ │ ├─ 1/
│ │ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 2/
│ │ │ │ │ │ ├─ 000000_0
│ │ │ │ │ │ ├─ 000001_0
│ │ │ │ │ │ ├─ 000002_0
│ │ │ │ │ │ ├─ 000003_0
│ │ │ │ ├─ table2/
│ │ │ │ │ ├─ 000000_0
│ │ │ │ │ ├─ 000001_0
│ │ │ │ │ ├─ 000002_0
│ │ │ │ │ ├─ 000003_0

Could someone help me figure out the reason Spark doesn't see Table 1 data?

Why does Hive keep 5 files with complex directory structure for Table 1 and only 4 files with a simple structure for Table 2?

Could it somehow affect Spark reading process?

P.S. Hive version is 2.3.3, Spark version is 2.4.4

Answer*

Normally data files are located inside table location without subdirectories.

UNION ALL is being optimized (most probably you are using Tez) and each query is running in parallel, independently as separated mapper tasks. This requires separate subdirectories to be created for each query in UNION ALL to make it possible to write results of each query simultaneously, this is why you have two directories. 

These settings allow Hive to read subdirectories:

 set hive.input.dir.recursive=true;
 set hive.mapred.supports.subdirectories=true;

There is an issue [SPARK-26663 - Cannot query a Hive table with subdirectories][1] - closed as cannot reproduce because they executed steps to reproduce on MR instead of Tez.

If you need to read such tables try to use HiveContext and set above properties. 

BTW your second query is more efficient because you are reading source table only once and does not create subdirectories.

Also you can try running you CREATE TABLE on MR, it will not create sub-dirs (`set hive.execution.engine=mr;`).

Also wrapping UNION ALL into subquery and adding something like DISTRIBUTE BY or ORDER will force additional reduce step, see https://stackoverflow.com/a/54998031/2700344


 [1]: https://issues.apache.org/jira/browse/SPARK-26663

Draft saved

Draft discarded

Edit Summary*

Cancel

Add a comment |

How to Edit

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

How to Format

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>

formatting help »
answering help »

How to Tag

A tag is a keyword or label that categorizes your question with other, similar questions. Choose one or more (up to 5) tags that will help answerers to find and interpret your question.

complete the sentence: my question is about...
use tags that describe things or concepts that are essential, not incidental to your question
favor using existing popular tags
read the descriptions that appear below the tag

If your question is primarily about a topic for which you can't find a tag:

combine multiple words into single-words with hyphens (e.g. python-3.x), up to a maximum of 35 characters
creating new tags is a privilege; if you can't yet create a tag you need, then post this question without it, then ask the community to create it for you

popular tags »

CollectivesTM on Stack Overflow

Empty Spark dataset reading Hive table

Answer*