#<Sawyer::Resource:0x00007f547e829e00>

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

[画像:fugue-project logo]

fugue-tutorials.readthedocs.io Source Code Changelog

Suggest Changes

Popularity

5.8

Growing

Activity

5.5

Stars 2,134

Watchers 21

Forks 100

Last Commit 22 days ago

Description

Fugue is a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites.

Programming language: Python

License: Apache License 2.0

Tags: Science And Data Analysis Parallel Computing Utilities

#<Sawyer::Resource:0x00007f547e829e00> alternatives and similar packages

Based on the "Science and Data Analysis" category.
Alternatively, view fugue alternatives based on common mentions on social networks and blogs.

Pandas

9.9 9.9 L2 #<Sawyer::Resource:0x00007f547e829e00> VS Pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

pandas-dev logo
NumPy

9.8 10.0 L1 #<Sawyer::Resource:0x00007f547e829e00> VS NumPy

The fundamental package for scientific computing with Python.

numpy logo

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.

Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

Promo getstream.io

[画像:Stream Logo]

SciPy

9.4 10.0 L2 #<Sawyer::Resource:0x00007f547e829e00> VS SciPy

SciPy library main repository

scipy logo
SymPy

9.4 9.9 L2 #<Sawyer::Resource:0x00007f547e829e00> VS SymPy

A computer algebra system written in pure Python

sympy logo
NetworkX

9.3 9.6 L3 #<Sawyer::Resource:0x00007f547e829e00> VS NetworkX

Network Analysis in Python

networkx logo
Dask

9.2 9.4 L2 #<Sawyer::Resource:0x00007f547e829e00> VS Dask

Parallel computing with task scheduling

dask logo
statsmodels

9.2 9.5 L3 #<Sawyer::Resource:0x00007f547e829e00> VS statsmodels

Statsmodels: statistical modeling and econometrics in Python

statsmodels logo
Getting Started

9.1 5.9 #<Sawyer::Resource:0x00007f547e829e00> VS Getting Started

PyGWalker: Turn your dataframe into an interactive UI for visual analysis

Kanaries logo
PyMC

8.9 9.3 L4 #<Sawyer::Resource:0x00007f547e829e00> VS PyMC

Bayesian Modeling and Probabilistic Programming in Python

pymc-devs logo
Numba

8.8 9.8 L3 #<Sawyer::Resource:0x00007f547e829e00> VS Numba

NumPy aware dynamic Python compiler using LLVM

numba logo
astropy

8.4 9.9 L2 #<Sawyer::Resource:0x00007f547e829e00> VS astropy

Astronomy and astrophysics core library

astropy logo
Biopython

8.3 9.1 L2 #<Sawyer::Resource:0x00007f547e829e00> VS Biopython

Official git repository for Biopython (originally converted from CVS)

biopython logo
orange

8.2 9.6 L2 #<Sawyer::Resource:0x00007f547e829e00> VS orange

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

biolab logo
RDKit

7.5 9.6 L1 #<Sawyer::Resource:0x00007f547e829e00> VS RDKit

The official sources for the RDKit library

rdkit logo
Statsforecast

7.5 7.4 #<Sawyer::Resource:0x00007f547e829e00> VS Statsforecast

Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla logo
Interactive Parallel Computing with IPython

7.3 7.8 L3 #<Sawyer::Resource:0x00007f547e829e00> VS Interactive Parallel Computing with IPython

IPython Parallel: Interactive Parallel Computing in Python

ipython logo
blaze

7.1 0.0 L4 #<Sawyer::Resource:0x00007f547e829e00> VS blaze

NumPy and Pandas interface to Big Data

blaze logo
Cubes

5.8 0.0 L3 #<Sawyer::Resource:0x00007f547e829e00> VS Cubes

[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis

DataBrewery logo
Open Mining

5.7 0.0 L3 #<Sawyer::Resource:0x00007f547e829e00> VS Open Mining

DISCONTINUED. Business Intelligence (BI) in Python, OLAP

mining logo
NIPY

5.4 6.7 L3 #<Sawyer::Resource:0x00007f547e829e00> VS NIPY

Workflows and interfaces for neuroimaging packages

nipy logo
bcbio-nextgen

5.4 6.2 L3 #<Sawyer::Resource:0x00007f547e829e00> VS bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

bcbio logo
bcolz

4.7 0.0 #<Sawyer::Resource:0x00007f547e829e00> VS bcolz

DISCONTINUED. A columnar data container that can be compressed.

Blosc logo
bccb

4.5 4.4 L4 #<Sawyer::Resource:0x00007f547e829e00> VS bccb

Incubator for useful bioinformatics code, primarily in Python and R

chapmanb logo
Neupy

4.4 0.0 L5 #<Sawyer::Resource:0x00007f547e829e00> VS Neupy

NeuPy is a Tensorflow based python library for prototyping and building neural networks

itdxer logo
Bubbles

3.7 0.0 L5 #<Sawyer::Resource:0x00007f547e829e00> VS Bubbles

[NOT MAINTAINED] Bubbles – Python ETL framework

Stiivi logo
PyDy

3.6 9.0 L3 #<Sawyer::Resource:0x00007f547e829e00> VS PyDy

Multibody dynamics tool kit.

pydy logo
harold

2.5 1.8 L2 #<Sawyer::Resource:0x00007f547e829e00> VS harold

An open-source systems and controls toolbox for Python3

ilayn logo
signac

2.5 8.4 #<Sawyer::Resource:0x00007f547e829e00> VS signac

Manage large and heterogeneous data spaces on the file system.

glotzerlab logo
PatZilla

2.3 1.8 #<Sawyer::Resource:0x00007f547e829e00> VS PatZilla

PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.

ip-tools logo
LynxKite

2.2 7.1 #<Sawyer::Resource:0x00007f547e829e00> VS LynxKite

The complete graph data science platform

lynxkite logo
Kotori

2.1 2.0 #<Sawyer::Resource:0x00007f547e829e00> VS Kotori

A flexible data historian based on InfluxDB, Grafana, MQTT, and more. Free, open, simple.

daq-tools logo
Terkin

1.8 0.0 #<Sawyer::Resource:0x00007f547e829e00> VS Terkin

Datalogger for MicroPython and CPython.

hiveeyes logo
dask-memusage

0.9 0.0 #<Sawyer::Resource:0x00007f547e829e00> VS dask-memusage

A low-impact profiler to figure out how much memory each task in Dask is using

itamarst logo
ElasticBatch

0.9 0.0 #<Sawyer::Resource:0x00007f547e829e00> VS ElasticBatch

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

dkaslovsky logo
cclib

0.9 #<Sawyer::Resource:0x00007f547e829e00> VS cclib

A library for parsing and interpreting the results of computational chemistry packages.
Open Babel

- #<Sawyer::Resource:0x00007f547e829e00> VS Open Babel

A chemical toolbox designed to speak the many languages of chemical data.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of #<Sawyer::Resource:0x00007f547e829e00> or a related project?

Add another 'Science and Data Analysis' Package

InfluxDB – Built for High-Performance Time Series Workloads

featured www.influxdata.com

Popular Comparisons

SaaSHub - Software Alternatives and Reviews

featured www.saashub.com

README

PyPI version PyPI pyversions PyPI license codecov Codacy Badge Downloads

Documentation	Tutorials	Chat with us on slack!
Doc	Jupyter Book Badge	Slack Status

Fugue is a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark, Dask and Ray without rewrites.

The most common use cases are:

Accelerating or scaling existing Python and pandas code by bringing it to Spark or Dask with minimal rewrites.
Using FugueSQL to define end-to-end workflows on top of pandas, Spark, and Dask DataFrames. FugueSQL is an enhanced SQL interface that can invoke Python code with added keywords.
Maintaining one codebase for pandas, Spark, Dask and Ray projects. Logic and execution are decoupled through Fugue, enabling users to be focused on their business logic rather than writing framework-specific code.
Improving iteration speed of big data projects. Fugue seamlessly scales execution to big data after local development and testing. By removing PySpark code, unit tests can be written in Python or pandas and ran locally without spinning up a cluster.

For a more comprehensive overview of Fugue, read this article.

Installation

Fugue can be installed through pip or conda. For example:

pip install fugue

It also has the following extras:

spark: to support Spark as the ExecutionEngine
dask: to support Dask as the ExecutionEngine.
ray: to support Ray as the ExecutionEngine.
duckdb: to support DuckDB as the ExecutionEngine, read details.
ibis: to enable Ibis for Fugue workflows, read details.
cpp_sql_parser: to enable the CPP antlr parser for Fugue SQL. It can be 50+ times faster than the pure Python parser. For the main Python versions and platforms, there is already pre-built binaries, but for the remaining, it needs a C++ compiler to build on the fly.
all: install everything above

For example a common use case is:

pip install fugue[duckdb,spark]

Notice that installing extras may not be necessary. For example if you already installed Spark or DuckDB independently, Fugue is able to automatically enable the support for them.

Getting Started

The best way to get started with Fugue is to work through the 10 minute tutorials:

The tutorials can also be run in an interactive notebook environment through binder or Docker:

Using binder

Binder

Note it runs slow on binder because the machine on binder isn't powerful enough for a distributed framework such as Spark. Parallel executions can become sequential, so some of the performance comparison examples will not give you the correct numbers.

Using Docker

Alternatively, you should get decent performance by running this Docker image on your own machine:

docker run -p 8888:8888 fugueproject/tutorials:latest

For the API docs, click here

Fugue Transform

The simplest way to use Fugue is the transform() function. This lets users parallelize the execution of a single function by bringing it to Spark, Dask or Ray. In the example below, the map_letter_to_food() function takes in a mapping and applies it on a column. This is just pandas and Python so far (without Fugue).

import pandas as pd
from typing import Dict
input_df = pd.DataFrame({"id":[0,1,2], "value": (["A", "B", "C"])})
map_dict = {"A": "Apple", "B": "Banana", "C": "Carrot"}
def map_letter_to_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame:
 df["value"] = df["value"].map(mapping)
 return df

Now, the map_letter_to_food() function is brought to the Spark execution engine by invoking the transform function of Fugue. The output schema, params and engine are passed to the transform() call. The schema is needed because it's a requirement on Spark. A schema of "*" below means all input columns are in the output.

from pyspark.sql import SparkSession
from fugue import transform
spark = SparkSession.builder.getOrCreate()
df = transform(input_df,
 map_letter_to_food,
 schema="*",
 params=dict(mapping=map_dict),
 engine=spark
 )
df.show()

+---+------+
| id| value|
+---+------+
| 0| Apple|
| 1|Banana|
| 2|Carrot|
+---+------+

PySpark equivalent of Fugue transform

from typing import Iterator, Union
from pyspark.sql.types import StructType
from pyspark.sql import DataFrame, SparkSession
spark_session = SparkSession.builder.getOrCreate()
def mapping_wrapper(dfs: Iterator[pd.DataFrame], mapping):
 for df in dfs:
 yield map_letter_to_food(df, mapping)
def run_map_letter_to_food(input_df: Union[DataFrame, pd.DataFrame], mapping):
 # conversion
 if isinstance(input_df, pd.DataFrame):
 sdf = spark_session.createDataFrame(input_df.copy())
 else:
 sdf = input_df.copy()
 schema = StructType(list(sdf.schema.fields))
 return sdf.mapInPandas(lambda dfs: mapping_wrapper(dfs, mapping),
 schema=schema)
result = run_map_letter_to_food(input_df, map_dict)
result.show()

This syntax is simpler, cleaner, and more maintainable than the PySpark equivalent. At the same time, no edits were made to the original pandas-based function to bring it to Spark. It is still usable on pandas DataFrames. Because the Spark execution engine was used, the returned df is now a Spark DataFrame. Fugue transform() also supports Dask, Ray and pandas as execution engines.

FugueSQL

A SQL-based language capable of expressing end-to-end workflows. The map_letter_to_food() function above is used in the SQL expression below. This is how to use a Python-defined transformer along with the standard SQL SELECT statement.

from fugue_sql import fsql
import json
query = """
 SELECT id, value FROM input_df
 TRANSFORM USING map_letter_to_food(mapping={{mapping}}) SCHEMA *
 PRINT
 """
map_dict_str = json.dumps(map_dict)
fsql(query,mapping=map_dict_str).run()

For FugueSQL, we can change the engine by passing it to the run() method: fsql(query,mapping=map_dict_str).run(spark).

Jupyter Notebook Extension

There is an accompanying notebook extension for FugueSQL that lets users use the %%fsql cell magic. The extension also provides syntax highlighting for FugueSQL cells. It works for both classic notebook and Jupyter Lab. More details can be found in the installation instructions.

FugueSQL gif

Ecosystem

By being an abstraction layer, Fugue can be used with a lot of other open-source projects seamlessly.

Fugue can use the following projects as backends:

Spark
Dask
Ray
Duckdb - in-process SQL OLAP database management
Ibis - pandas-like interface for SQL engines
dask-sql - SQL interface for Dask

Fugue is available as a backend or can integrate with the following projects:

PyCaret - low code machine learning
Pandera - data validation
Nixtla - timeseries modelling

Further Resources

View some of our latest conferences presentations and content. For a more complete list, check the Resources page in the tutorials.

Case Studies

How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue

Blogs

Conferences

Community and Contributing

Feel free to message us on Slack. We also have [contributing instructions](CONTRIBUTING.md).

*Note that all licence references and agreements mentioned in the #<Sawyer::Resource:0x00007f547e829e00> README section above are relevant to that project's source code only.

Do not miss the trending, packages, news and articles with our weekly report.

Awesome Python is part of the LibHunt network. Terms. Privacy Policy.

(CC)

BY-SA

We recommend Spin The Wheel Of Names for a cryptographically secure random name picker.