187 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
0
votes
0
answers
84
views
How to merge small parquet files in Hudi into larger files
I use Spark+ Hudi to write data into S3. I was writing data in bulk_insert mode, which cause there be many small paruqet files in Hudi table.
Then I try to schedule clustering on the Hudi table:
...
0
votes
0
answers
69
views
Flink CDC + Hudi isn't working as expect, found log said state is cleared
I'm using Flink CDC + Apache Hudi in Flink to sync data from MySQl to AWS S3. My Flink job looks like:
parallelism = 1
env = StreamExecutionEnvironment.get_execution_environment(config)
...
0
votes
1
answer
65
views
How to limit single file size when using Flink batch mode to write Parquet
I was using Flink in batch mode to read data from one source and then directly write the data into file system as Parquet format.
The code was like:
hudi_source_ddl = f"""
...
0
votes
1
answer
63
views
Unable to register database/table in aws glue when hudi job is submitted from emrserverless
I am using emr 6.15 and hudi 0.14
I submitted following hudi job which should create a database and a table in aws glue. IAM Role assigned to EMR serverless has all neccessary permissions of s3 and ...
0
votes
1
answer
52
views
Getting NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT while using Hudi
Getting the error when I try to execute spark sql.
Caused by: org.apache.spark.sql.AnalysisException: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] CREATE Hive TABLE (AS SELECT)
is not supported, if ...
0
votes
1
answer
100
views
Unable to sink Hudi table to S3 in Flink
I'm trying to use Flink-cdc to capture data change from Mysql and update the Hudi table in S3.
My pyFlink job was like:
env = StreamExecutionEnvironment.get_execution_environment(config)
env....
0
votes
1
answer
139
views
java.io.IOException: No FileSystem for scheme: s3 in Flink
I run a Flink in Docker on my local env. And I try to write a Flink job to use CDC to sync Mysql data to S3 (stored as Apache Hudi format). My Flink job looks like:
t_env = StreamTableEnvironment....
0
votes
0
answers
57
views
Hudi Compaction is very slow via Apache Flink
I have written a pipeline in which I am sinking the data from Kafka to Hudi-S3. It is working, but compaction is very very slow.
It is a batch job that runs every hour and sinks the last hour data to ...
0
votes
0
answers
75
views
Sinking to Hudi Table by using Spark and Flink together into the same S3 folder
I have one use case.
Sink(S1) -> I have written a job in Spark that is sinking the data from OpenSearch to S3.
Sink(S2) -> I have another job which is sinking the data from Kafka to S3 into the ...
1
vote
0
answers
21
views
why the spark task not serialized?
this is the exception by the VarScoreData is a case class
code:
case class VarScoreData(part: String, day: String, tel: String,
var_array: Array[Double], score_array: Array[Double])
...
-1
votes
1
answer
53
views
Is key uniqueness enforced within partitions or across all partitions?
Question:
I am working with Apache Flink (Flink SQL) to manage Hudi tables, and I noticed that Hudi supports multiple index types. According to the official documentation on Index Types in Hudi, these ...
1
vote
1
answer
148
views
pyspark Could not load key generator class org.apache.hudi.keygen.ComplexKeyGenerator
When Pyspark is used to write data to the hudi table and the options content is as follows:
hudi_options = { 'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.ComplexKeyGenerator',...
1
vote
0
answers
38
views
Error due to S3 Partitions having different Datatype in Hudi Presto Table
We have data written to S3 in Hudi format with dt partition. Recently, we started receiving very large numbers for some columns stored as long datatype. These numbers exceeded the maximum limit of the ...
0
votes
1
answer
148
views
How Does Apache Hudi Perform Snapshot Queries
It's not really clear to me how does Hudi ensure efficient snapshot queries (see https://hudi.apache.org/docs/next/table_types/)
What I see in the .hoodie folder is just a timeline consisting of lots ...
0
votes
1
answer
169
views
Alter table name to another table name with AWS EMR and S3
Using EMR 7.2
spark-sql (default)> ALTER TABLE account RENAME TO accountinfo;
24/08/14 02:31:04 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager ...