InfoQ Homepage News Databricks Open Sources Delta Lake to Make Data Lakes More Reliable
Databricks Open Sources Delta Lake to Make Data Lakes More Reliable
This item in japanese
Lire ce contenu en français
May 20, 2019 1 min read
by
Write for InfoQ
Feed your curiosity. Help 550k+ globalsenior developers
each month stay ahead.Get in touch
Databricks recently announced open sourcing Delta Lake, their proprietary storage layer, to bring ACID transactions to Apache Spark and big data workloads. Databricks is the company behind the creators of Apache Spark, while Delta Lake is already being used in several companies like McGraw Hill, McAffee, Upwork and Booz Allen Hamilton.
Delta Lake is addressing the heterogeneous data problem that data lakes often have. Ingesting data from multiple pipelines means that engineers need to enforce data integrity manually, throughout all the data sources. Delta Lake can bring ACID transactions to the data lake, with the strongest level of isolation applied, serializability.
Delta Lake provides time travelling, being able to fetch every version of a file in time, a feature quite useful for GDPR and other audit related requests. Metadata on files are stored using the exact same process as data, enabling the same level of processing and feature richness.
Delta Lake provides schema enforcement capabilities. Data types and presence of fields can be checked and enforced, making sure that the data can be kept clean. Schema changes on the other hand, don’t require DDL but can be applied automatically.
Delta Lake is deployed on top of the existing data lake, it is compatible with both batch and streaming data and can be plugged into an existing Spark job as a new data source. Data is stored in the familiar Apache Parquet format.
Delta Lake is also compatible with MLFlow, Databricks newest open source platform that was launched last year. The code is available on GitHub.
This content is in the AI, ML & Data Engineering topic
Related Topics:
-
Related Editorial
-
Related Sponsors
-
Popular across InfoQ
-
Grafana and GitLab Introduce Serverless CI/CD Observability Integration
-
TanStack Start: A New Meta Framework Powered by React or SolidJS
-
Redis Critical Remote Code Execution Vulnerability Discovered after 13 Years
-
Java News Roundup: OpenJDK JEPs for JDK 26, Spring RCs, Quarkus, JReleaser, Seed4J, Gradle
-
GitHub Expands Copilot Ecosystem with AgentHQ
-
If You Can’t Test It, Don’t Deploy It: The New Rule of AI Development?
-
Related Content
The InfoQ Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example