[フレーム]
BT

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

We protect your privacy.

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Log In
or

Don't have an InfoQ account?

Register
  • Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
  • Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
  • Save articles and read at anytimeBookmark articles to read whenever youre ready.

Topics

Choose your language

InfoQ Homepage News Apache Hudi 1.0 Now Generally Available

Apache Hudi 1.0 Now Generally Available

Jan 18, 2025 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.
Get in touch
Listen to this article - 0:00
Audio ready to play
0:00
0:00

The Apache Software Foundation has recently announced the general availability of Apache Hudi 1.0, the transactional data lake platform with support for near real-time analytics. Initially introduced in 2017, Apache Hudi provides an open table format optimized for efficient writes in incremental data pipelines and fast query performance.

Originally developed at Uber as an incremental processing framework on Apache Hadoop and submitted to the Apache Software Foundation in 2019, Hudi is designed to bridge the gap between database-like functionality and open data lakehouse architectures. Hudi’s main strength lies in its ability to support both near real-time and batch queries simultaneously.

The latest release introduces new features aimed at transforming data lakehouses into what the project community considers a fully-fledged "Data Lakehouse Management System" (DLMS). Vinoth Chandar, creator of the Hudi Project at Uber and CEO at Onehouse, writes:

Hudi shines by providing a high-performance open table format as well as a comprehensive open-source software stack that can ingest, store, optimize and effectively self-manage a data lakehouse. This distinction between open formats and open software is often lost in translation inside the large vendor ecosystem in which Hudi operates. Still, it has been and remains a key consideration for Hudi’s users to avoid compute-lockin to any given data vendor.

Released under an Apache License 2.0, Hudi 1.0 introduces a new secondary indexing system designed to enhance query performance and reduce data scanning costs. Users can now create SQL-based indexes on secondary columns, significantly speeding up query execution. The release also includes expression-based indexing, similar to a feature in PostgreSQL, which replaces traditional partitioning strategies to enable more flexible and efficient data organization. When the preview was announced last year, Boris Litvak, principal software engineer at Snyk, wrote:

Among the big 3 ACID storage formats on Object Storage, Apache Hudi 1.0 (beta) is the first one introducing "functional indexes" over the data. We usually call it "secondary indexes" in SQL DB jargon. When will Delta.io and Apache Iceberg follow?

Source: Apache Hudi Blog

The release introduces support for partial updates, which improves storage and compute efficiency by allowing updates to specific fields instead of entire rows. Additionally, non-blocking concurrency control enables multiple streaming jobs to write to the same dataset without causing bottlenecks or failures. Discussing the database architecture, Chandar adds:

Regarding full-fledged DLMS functionality, the closest experience Hudi 1.0 offers is through Apache Spark. Users can deploy a Spark server (or Spark Connect) with Hudi 1.0 installed, submit SQL/jobs, orchestrate table services via SQL commands, and enjoy new secondary index functionality to speed up queries like a DBMS.

Hudi 1.0 introduces enhancements to the storage engine, including the adoption of a log-structured merge (LSM) tree for efficient timeline management. This supports long-term data retention and ensures high-performance query planning, even for datasets containing billions of records. Bhavani Sudha Saktheeswaran, software engineer at Onehouse and Apache Hudi PMC, comments:

Whether you're building an open data platform, streaming into the data lakehouse, moving away from data warehouses, or optimizing for high-performance queries, Hudi 1.0.0 makes it easier than ever to work with lakehouses.

Saktheeswaran and Saketh Chintapalli, software engineer at Uber, presented a session on incremental data processing with Apache Hudi at QCon San Francisco. The session recording is available on InfoQ.

About the Author

Renato Losio

Show moreShow less

Rate this Article

Adoption
Style

Related Content

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

BT

AltStyle によって変換されたページ (->オリジナル) /