Newest 'apache-hudi' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

187 questions

0 votes

0 answers

84 views

How to merge small parquet files in Hudi into larger files

I use Spark+ Hudi to write data into S3. I was writing data in bulk_insert mode, which cause there be many small paruqet files in Hudi table. Then I try to schedule clustering on the Hudi table: ...

Rinze's user avatar

Rinze

asked Sep 17 at 10:35

0 votes

0 answers

69 views

Flink CDC + Hudi isn't working as expect, found log said state is cleared

I'm using Flink CDC + Apache Hudi in Flink to sync data from MySQl to AWS S3. My Flink job looks like: parallelism = 1 env = StreamExecutionEnvironment.get_execution_environment(config) ...

Rinze's user avatar

Rinze

asked Aug 29 at 11:31

0 votes

1 answer

65 views

How to limit single file size when using Flink batch mode to write Parquet

I was using Flink in batch mode to read data from one source and then directly write the data into file system as Parquet format. The code was like: hudi_source_ddl = f""" ...

Rinze's user avatar

Rinze

asked Jul 22 at 1:59

0 votes

1 answer

63 views

Unable to register database/table in aws glue when hudi job is submitted from emrserverless

I am using emr 6.15 and hudi 0.14 I submitted following hudi job which should create a database and a table in aws glue. IAM Role assigned to EMR serverless has all neccessary permissions of s3 and ...

Roobal Jindal's user avatar

Roobal Jindal

asked Jul 9 at 7:00

0 votes

1 answer

52 views

Getting NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT while using Hudi

Getting the error when I try to execute spark sql. Caused by: org.apache.spark.sql.AnalysisException: [NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT] CREATE Hive TABLE (AS SELECT) is not supported, if ...

Albert T. Wong's user avatar

Albert T. Wong

1,703

asked Apr 18 at 15:54

0 votes

1 answer

100 views

Unable to sink Hudi table to S3 in Flink

I'm trying to use Flink-cdc to capture data change from Mysql and update the Hudi table in S3. My pyFlink job was like: env = StreamExecutionEnvironment.get_execution_environment(config) env....

Rinze's user avatar

Rinze

asked Mar 17 at 10:59

0 votes

1 answer

139 views

java.io.IOException: No FileSystem for scheme: s3 in Flink

I run a Flink in Docker on my local env. And I try to write a Flink job to use CDC to sync Mysql data to S3 (stored as Apache Hudi format). My Flink job looks like: t_env = StreamTableEnvironment....

Rinze's user avatar

Rinze

asked Mar 5 at 11:09

0 votes

0 answers

57 views

Hudi Compaction is very slow via Apache Flink

I have written a pipeline in which I am sinking the data from Kafka to Hudi-S3. It is working, but compaction is very very slow. It is a batch job that runs every hour and sinks the last hour data to ...

Vishal's user avatar

Vishal

asked Feb 27 at 10:47

0 votes

0 answers

75 views

Sinking to Hudi Table by using Spark and Flink together into the same S3 folder

I have one use case. Sink(S1) -> I have written a job in Spark that is sinking the data from OpenSearch to S3. Sink(S2) -> I have another job which is sinking the data from Kafka to S3 into the ...

Vishal's user avatar

Vishal

asked Feb 18 at 7:46

1 vote

0 answers

21 views

why the spark task not serialized?

this is the exception by the VarScoreData is a case class code: case class VarScoreData(part: String, day: String, tel: String, var_array: Array[Double], score_array: Array[Double]) ...

Ahian Liu's user avatar

Ahian Liu

asked Feb 13 at 15:04

-1 votes

1 answer

53 views

Is key uniqueness enforced within partitions or across all partitions?

Question: I am working with Apache Flink (Flink SQL) to manage Hudi tables, and I noticed that Hudi supports multiple index types. According to the official documentation on Index Types in Hudi, these ...

wancrin potter's user avatar

wancrin potter

asked Jan 22 at 12:55

1 vote

1 answer

148 views

pyspark Could not load key generator class org.apache.hudi.keygen.ComplexKeyGenerator

When Pyspark is used to write data to the hudi table and the options content is as follows: hudi_options = { 'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.ComplexKeyGenerator',...

lysqlq's user avatar

lysqlq

asked Jan 9 at 8:20

1 vote

0 answers

38 views

Error due to S3 Partitions having different Datatype in Hudi Presto Table

We have data written to S3 in Hudi format with dt partition. Recently, we started receiving very large numbers for some columns stored as long datatype. These numbers exceeded the maximum limit of the ...

Abhishek Gupta's user avatar

Abhishek Gupta

asked Jan 9 at 2:40

0 votes

1 answer

148 views

How Does Apache Hudi Perform Snapshot Queries

It's not really clear to me how does Hudi ensure efficient snapshot queries (see https://hudi.apache.org/docs/next/table_types/) What I see in the .hoodie folder is just a timeline consisting of lots ...

apache-hudi

oceansize's user avatar

oceansize

asked Oct 17, 2024 at 20:55

0 votes

1 answer

169 views

Alter table name to another table name with AWS EMR and S3

Using EMR 7.2 spark-sql (default)> ALTER TABLE account RENAME TO accountinfo; 24/08/14 02:31:04 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager ...

Albert T. Wong's user avatar

Albert T. Wong

1,703

asked Aug 14, 2024 at 2:33

15 30 50 per page

2 3 4 5

...

13 Next

CollectivesTM on Stack Overflow

How to merge small parquet files in Hudi into larger files

Flink CDC + Hudi isn't working as expect, found log said state is cleared

How to limit single file size when using Flink batch mode to write Parquet

Unable to register database/table in aws glue when hudi job is submitted from emrserverless

Getting NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT while using Hudi

Unable to sink Hudi table to S3 in Flink

java.io.IOException: No FileSystem for scheme: s3 in Flink

Hudi Compaction is very slow via Apache Flink

Sinking to Hudi Table by using Spark and Flink together into the same S3 folder

why the spark task not serialized?

Is key uniqueness enforced within partitions or across all partitions?

pyspark Could not load key generator class org.apache.hudi.keygen.ComplexKeyGenerator

Error due to S3 Partitions having different Datatype in Hudi Presto Table

How Does Apache Hudi Perform Snapshot Queries

Alter table name to another table name with AWS EMR and S3

Hot Network Questions