[フレーム]
BT

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

We protect your privacy.

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Log In
or

Don't have an InfoQ account?

Register
  • Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
  • Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
  • Save articles and read at anytimeBookmark articles to read whenever youre ready.

Topics

Choose your language

InfoQ Homepage News Databricks Open Sources Delta Lake to Make Data Lakes More Reliable

Databricks Open Sources Delta Lake to Make Data Lakes More Reliable

This item in japanese

Lire ce contenu en français

May 20, 2019 1 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.
Get in touch

Databricks recently announced open sourcing Delta Lake, their proprietary storage layer, to bring ACID transactions to Apache Spark and big data workloads. Databricks is the company behind the creators of Apache Spark, while Delta Lake is already being used in several companies like McGraw Hill, McAffee, Upwork and Booz Allen Hamilton.

Delta Lake is addressing the heterogeneous data problem that data lakes often have. Ingesting data from multiple pipelines means that engineers need to enforce data integrity manually, throughout all the data sources. Delta Lake can bring ACID transactions to the data lake, with the strongest level of isolation applied, serializability.

Delta Lake provides time travelling, being able to fetch every version of a file in time, a feature quite useful for GDPR and other audit related requests. Metadata on files are stored using the exact same process as data, enabling the same level of processing and feature richness.

Delta Lake provides schema enforcement capabilities. Data types and presence of fields can be checked and enforced, making sure that the data can be kept clean. Schema changes on the other hand, don’t require DDL but can be applied automatically.

Delta Lake is deployed on top of the existing data lake, it is compatible with both batch and streaming data and can be plugged into an existing Spark job as a new data source. Data is stored in the familiar Apache Parquet format.

Delta Lake is also compatible with MLFlow, Databricks newest open source platform that was launched last year. The code is available on GitHub.

Rate this Article

Adoption
Style

Related Content

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

BT

AltStyle によって変換されたページ (->オリジナル) /