Jump to content
Wikipedia The Free Encyclopedia

Draft:Apache DataFusion

From Wikipedia, the free encyclopedia
Open-source analytical query engine
Review waiting, please be patient.

This may take 3 months or more, since drafts are reviewed in no specific order. There are 4,318 pending submissions waiting for review.


  • If the submission is accepted, then this page will be moved into the article space.
  • If the submission is declined, then the reason will be posted here.
  • In the meantime, you can continue to improve this submission by editing normally.

Where to get help
  • If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
  • If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.
How to improve a draft

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Editor resources

Reviewer tools
Apache DataFusion
Developer Apache Software Foundation
Written inRust
Type Query engine
License Apache License
Websitedatafusion.apache.org

Apache DataFusion is an open-source, extensible analytical query engine written in Rust, built on Apache Arrow's columnar memory format.[1] [2] It provides SQL and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.[1] [2] The project originated in 2017, was donated to the Apache Arrow project in 2019, and became a top-level project of the Apache Software Foundation in 2024.[3] [4] As of March 2026, DataFusion exceeded one million monthly downloads on crates.io.[5]

History

[edit ]

DataFusion originally authored by Andy Grove starting in 2017. It was donated to the Apache Arrow Project in February 2019.[3] In 2024, a paper describing DataFusion was accepted to the industry track of the ACM SIGMOD conference.[6] [1] In April 2024, the project graduated from Apache Arrow and became a top-level Apache project.[4]

Features

[edit ]

DataFusion is a fast, extensible query engine for building data systems. It provides a SQL interface and a DataFrame API for constructing queries programmatically, a query planner and rule-based optimizer, and a multithreaded vectorized execution engine that processes data in columnar batches rather than row by row.[1] [2]

The engine reads common analytical file formats natively, including Apache Parquet, CSV, JSON, Avro, and Arrow IPC, and uses Apache Arrow's columnar memory format throughout execution, avoiding serialization overhead between stages.[1]

DataFusion is designed for in-process embedding: it runs within the host application's process rather than as a separate server, using threads for parallel query execution. Its extension points allow downstream systems to add user-defined functions, custom data sources, custom query languages, and new optimizer rules, enabling developers to build specialized database systems on top of DataFusion's planning and execution components without reimplementing them.[1] [2]

[edit ]

DataFusion is frequently compared with other columnar analytical systems including DuckDB, Polars, and Velox, but these systems differ significantly in scope and intended use.[7]

Adoption and reception

[edit ]

DataFusion has been adopted across a range of analytics and database products. Cloudflare used DataFusion in its Log Explorer product to execute SQL queries over log data stored in Cloudflare R2.[8] Palantir Lightweight Pipelines are powered by DataFusion.[9] [10] InfluxDB 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet.[11] Other users described in public sources include EDB Postgres AI,[12] Cube,[13] Spice AI,[14] Pydantic Logfire,[15] and Kamu.[16]

In 2024, CRN included Apache DataFusion in its list of "The 10 Coolest Open-Source Software Tools Of 2024".[17]

Language support

[edit ]

DataFusion itself is written in Rust. The project also has official Python bindings and community-maintained bindings and tooling for other languages and runtimes.[18] [19]

Ecosystem projects

[edit ]

Several projects in the broader Apache ecosystem and the community-maintained datafusion-contrib organization extend DataFusion's capabilities.[19]

References

[edit ]
  1. ^ a b c d e f Lamb, Andrew; Shen, Yijie; Heres, Daniel; Chakraborty, Jayjeet; Kabak, Mehmet Ozan; Hsieh, Liang-Chi; Sun, Chao (2024). "Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine". Proceedings of the 2024 International Conference on Management of Data. doi:10.1145/3626246.3653368.
  2. ^ a b c d "Introduction". Apache DataFusion. Apache Software Foundation. Retrieved 2026年03月22日.
  3. ^ a b "DataFusion: A Rust-native Query Engine for Apache Arrow". Apache DataFusion Blog. Apache Software Foundation. 2019年02月04日. Retrieved 2026年03月22日.
  4. ^ a b "Apache Software Foundation Announces New Top-Level Project Apache DataFusion". The ASF Blog. Apache Software Foundation. 2024年06月11日. Retrieved 2026年03月22日.
  5. ^ "datafusion". crates.io. Retrieved 2026年03月26日.
  6. ^ "SIGMOD 2024 Industrial Track: Accepted Papers". SIGMOD 2024. Retrieved 2026年03月22日.
  7. ^ Pedreira, Pedro; Erling, Orri; Mühleisen, Hannes; Muñoz, Ruben; Khaled, Wael; Dürsch, Peter (2023). "The Composable Data Management System Manifesto". Proceedings of the VLDB Endowment. 16 (10). doi:10.14778/3603581.3603604.
  8. ^ "Cloudflare Log Explorer is now GA, providing native observability and forensics". The Cloudflare Blog. Cloudflare. 2025年06月18日. Retrieved 2026年03月22日.
  9. ^ "Announcements: July 2025". Palantir Foundry Documentation. Palantir Technologies. 2025年07月29日. Retrieved 2026年03月22日.
  10. ^ "Announcements: February 2024". Palantir Foundry Documentation. Palantir Technologies. February 2024. Retrieved 2026年03月22日.
  11. ^ "Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0". InfluxData. 2023年10月25日. Retrieved 2026年03月22日.
  12. ^ "Enterprise DB begins rolling AI features into PostgreSQL". SiliconANGLE. 2024年05月23日. Retrieved 2026年03月22日.
  13. ^ "Query pushdown in Cube's semantic layer". Cube. 2024年06月03日. Retrieved 2026年03月22日.
  14. ^ "How we use Apache DataFusion at Spice AI". Spice AI. 2026年01月17日. Retrieved 2026年03月22日.
  15. ^ "We're changing database". GitHub. 2024年08月29日. Retrieved 2026年03月22日.
  16. ^ "100X faster ingestion, and FlightSQL support for connecting BI tools". Kamu Data. 2023年09月26日. Retrieved 2026年03月22日.
  17. ^ "The 10 Coolest Open-Source Software Tools Of 2024". CRN. 2024年11月21日. Retrieved 2026年03月22日.
  18. ^ a b "datafusion-contrib". GitHub. Retrieved 2026年03月22日.
[edit ]

AltStyle によって変換されたページ (->オリジナル) /