dask-memusage

A low-impact profiler to figure out how much memory each task in Dask is using

[画像:itamarst logo]

Source Code Changelog

Suggest Changes

Popularity

0.9

Stable

Activity

0.0

Stable

Stars 24

Watchers 2

Forks 1

Last Commit almost 3 years ago

Description

If you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism. That means you want to know how much memory each task uses:

1. So you can set the highest parallelism level (process or threads) for each machine, given available to RAM. 2. In order to know where to focus memory optimization efforts.

dask-memusage is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.

dask-memusage polls your processes for memory usage and records the minimum and maximum usage in a CSV.

Programming language: Python

License: MIT License

Tags: Profiler Science And Data Analysis Scientific Distributed Computing

Latest version: v1.1

dask-memusage alternatives and similar packages

Based on the "Science and Data Analysis" category.
Alternatively, view dask-memusage alternatives based on common mentions on social networks and blogs.

Pandas

9.9 9.9 L2 dask-memusage VS Pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

pandas-dev logo
NumPy

9.8 10.0 L1 dask-memusage VS NumPy

The fundamental package for scientific computing with Python.

numpy logo

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.

Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

Promo getstream.io

[画像:Stream Logo]

SciPy

9.4 10.0 L2 dask-memusage VS SciPy

SciPy library main repository

scipy logo
SymPy

9.4 9.9 L2 dask-memusage VS SymPy

A computer algebra system written in pure Python

sympy logo
NetworkX

9.3 9.6 L3 dask-memusage VS NetworkX

Network Analysis in Python

networkx logo
Dask

9.2 9.4 L2 dask-memusage VS Dask

Parallel computing with task scheduling

dask logo
statsmodels

9.2 9.5 L3 dask-memusage VS statsmodels

Statsmodels: statistical modeling and econometrics in Python

statsmodels logo
Getting Started

9.1 5.9 dask-memusage VS Getting Started

PyGWalker: Turn your dataframe into an interactive UI for visual analysis

Kanaries logo
PyMC

8.9 9.3 L4 dask-memusage VS PyMC

Bayesian Modeling and Probabilistic Programming in Python

pymc-devs logo
Numba

8.8 9.8 L3 dask-memusage VS Numba

NumPy aware dynamic Python compiler using LLVM

numba logo
astropy

8.4 9.9 L2 dask-memusage VS astropy

Astronomy and astrophysics core library

astropy logo
Biopython

8.3 9.1 L2 dask-memusage VS Biopython

Official git repository for Biopython (originally converted from CVS)

biopython logo
orange

8.2 9.6 L2 dask-memusage VS orange

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

biolab logo
RDKit

7.5 9.6 L1 dask-memusage VS RDKit

The official sources for the RDKit library

rdkit logo
Statsforecast

7.5 7.4 dask-memusage VS Statsforecast

Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla logo
Interactive Parallel Computing with IPython

7.3 8.0 L3 dask-memusage VS Interactive Parallel Computing with IPython

IPython Parallel: Interactive Parallel Computing in Python

ipython logo
blaze

7.1 0.0 L4 dask-memusage VS blaze

NumPy and Pandas interface to Big Data

blaze logo
#<Sawyer::Resource:0x00007f547e829e00>

5.8 5.5 dask-memusage VS #<Sawyer::Resource:0x00007f547e829e00>

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

fugue-project logo
Cubes

5.8 0.0 L3 dask-memusage VS Cubes

[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis

DataBrewery logo
Open Mining

5.7 0.0 L3 dask-memusage VS Open Mining

DISCONTINUED. Business Intelligence (BI) in Python, OLAP

mining logo
bcbio-nextgen

5.4 6.2 L3 dask-memusage VS bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

bcbio logo
NIPY

5.4 6.7 L3 dask-memusage VS NIPY

Workflows and interfaces for neuroimaging packages

nipy logo
bcolz

4.7 0.0 dask-memusage VS bcolz

DISCONTINUED. A columnar data container that can be compressed.

Blosc logo
bccb

4.5 4.4 L4 dask-memusage VS bccb

Incubator for useful bioinformatics code, primarily in Python and R

chapmanb logo
Neupy

4.4 0.0 L5 dask-memusage VS Neupy

NeuPy is a Tensorflow based python library for prototyping and building neural networks

itdxer logo
Bubbles

3.7 0.0 L5 dask-memusage VS Bubbles

[NOT MAINTAINED] Bubbles – Python ETL framework

Stiivi logo
PyDy

3.6 9.0 L3 dask-memusage VS PyDy

Multibody dynamics tool kit.

pydy logo
harold

2.5 1.8 L2 dask-memusage VS harold

An open-source systems and controls toolbox for Python3

ilayn logo
signac

2.5 8.4 dask-memusage VS signac

Manage large and heterogeneous data spaces on the file system.

glotzerlab logo
PatZilla

2.3 1.8 dask-memusage VS PatZilla

PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.

ip-tools logo
LynxKite

2.2 7.1 dask-memusage VS LynxKite

The complete graph data science platform

lynxkite logo
Kotori

2.1 2.0 dask-memusage VS Kotori

A flexible data historian based on InfluxDB, Grafana, MQTT, and more. Free, open, simple.

daq-tools logo
Terkin

1.8 0.0 dask-memusage VS Terkin

Datalogger for MicroPython and CPython.

hiveeyes logo
ElasticBatch

0.9 0.0 dask-memusage VS ElasticBatch

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

dkaslovsky logo
cclib

0.9 dask-memusage VS cclib

A library for parsing and interpreting the results of computational chemistry packages.
Open Babel

- dask-memusage VS Open Babel

A chemical toolbox designed to speak the many languages of chemical data.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of dask-memusage or a related project?

Add another 'Science and Data Analysis' Package

InfluxDB – Built for High-Performance Time Series Workloads

featured www.influxdata.com

Popular Comparisons

SaaSHub - Software Alternatives and Reviews

featured www.saashub.com

README

dask-memusage

If you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism. That means you want to know how much memory each task uses:

So you can set the highest parallelism level (process or threads) for each machine, given available to RAM.
In order to know where to focus memory optimization efforts.

dask-memusage is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.

dask-memusage polls your processes for memory usage and records the minimum and maximum usage in a CSV:

task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625
task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625

Usage

Important: Make sure your workers only have a single thread! Otherwise the results will be wrong.

Installation

On the machine where you are running the Distributed scheduler, run:

$ pip install dask_memusage

Or if you're using Conda:

$ conda install -c conda-forge dask-memusage

API usage

# Add to your Scheduler object, which is e.g. your LocalCluster's scheduler
# attribute:
from dask_memoryusage import install
install(scheduler, "/tmp/memusage.csv")

CLI usage

$ dask-scheduler --preload dask_memusage --memusage.csv /tmp/memusage.csv

Limitations

Again, make sure you only have one thread per worker process.
This is statistical profiling, running every 10ms. Tasks that take less than that won't have accurate information.

Help

Need help? File a ticket at https://github.com/itamarst/dask-memusage/issues/new

*Note that all licence references and agreements mentioned in the dask-memusage README section above are relevant to that project's source code only.

Do not miss the trending, packages, news and articles with our weekly report.

Awesome Python is part of the LibHunt network. Terms. Privacy Policy.

(CC)

BY-SA

We recommend Spin The Wheel Of Names for a cryptographically secure random name picker.