Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

krzjoa/awesome-python-data-science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

488 Commits

Repository files navigation

Awesome Python Data Science


Probably the best curated list of data science software in Python

Contents

Machine Learning

General Purpose Machine Learning

Gradient Boosting

Ensemble Methods

Imbalanced Datasets

Random Forests

Kernel Methods

Deep Learning

PyTorch

TensorFlow

JAX

  • JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
  • FLAX - A neural network library for JAX that is designed for flexibility.
  • Optax - A gradient processing and optimization library for JAX.

Others

Automated Machine Learning

Natural Language Processing

Computer Audition

  • torchaudio - An audio library for PyTorch. PyTorch based/compatible
  • librosa - Python library for audio and music analysis.
  • Yaafe - Audio features extraction.
  • aubio - A library for audio and music analysis.
  • Essentia - Library for audio and music analysis, description, and synthesis.
  • LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
  • Marsyas - Music Analysis, Retrieval, and Synthesis for Audio Signals.
  • muda - A library for augmenting annotated audio data.
  • madmom - Python audio and music signal processing library.

Computer Vision

Time Series

  • sktime - A unified framework for machine learning with time series. sklearn
  • skforecast - Time series forecasting with machine learning models
  • darts - A python library for easy manipulation and forecasting of time series.
  • statsforecast - Lightning fast forecasting with statistical and econometric models.
  • mlforecast - Scalable machine learning-based time series forecasting.
  • neuralforecast - Scalable machine learning-based time series forecasting.
  • tslearn - Machine learning toolkit dedicated to time-series data. sklearn
  • tick - Module for statistical learning, with a particular emphasis on time-dependent modeling. sklearn
  • greykite - A flexible, intuitive, and fast forecasting library next.
  • Prophet - Automatic Forecasting Procedure.
  • PyFlux - Open source time series library for Python.
  • bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
  • luminol - Anomaly Detection and Correlation library.
  • dateutil - Powerful extensions to the standard datetime module
  • maya - makes it very easy to parse a string and for changing timezones
  • Chaos Genius - ML powered analytics engine for outlier/anomaly detection and root cause analysis

Reinforcement Learning

  • Gymnasium - An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).
  • PettingZoo - An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities.
  • MAgent2 - An engine for high performance multi-agent environments with very large numbers of agents, along with a set of reference environments.
  • Stable Baselines3 - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
  • Shimmy - An API conversion tool for popular external reinforcement learning environments.
  • EnvPool - C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
  • RLlib - Scalable Reinforcement Learning.
  • Tianshou - An elegant PyTorch deep reinforcement learning library. PyTorch based/compatible
  • Acme - A library of reinforcement learning components and agents.
  • Catalyst-RL - PyTorch framework for RL research. PyTorch based/compatible
  • d3rlpy - An offline deep reinforcement learning library.
  • DI-engine - OpenDILab Decision AI Engine. PyTorch based/compatible
  • TF-Agents - A library for Reinforcement Learning in TensorFlow. TensorFlow
  • TensorForce - A TensorFlow library for applied reinforcement learning. TensorFlow
  • TRFL - TensorFlow Reinforcement Learning. sklearn
  • Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
  • keras-rl - Deep Reinforcement Learning for Keras. Keras compatible
  • garage - A toolkit for reproducible reinforcement learning research.
  • Horizon - A platform for Applied Reinforcement Learning.
  • rlpyt - Reinforcement Learning in PyTorch. PyTorch based/compatible
  • cleanrl - High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG).
  • Machin - A reinforcement library designed for pytorch. PyTorch based/compatible
  • SKRL - Modular reinforcement learning library (on PyTorch and JAX) with support for NVIDIA Isaac Gym, Isaac Orbit and Omniverse Isaac Gym. PyTorch based/compatible
  • Imitation - Clean PyTorch implementations of imitation and reward learning algorithms. PyTorch based/compatible

Graph Machine Learning

Learning-to-Rank & Recommender Systems

Probabilistic Graphical Models

Probabilistic Methods

Model Explanation

  • dalex - moDel Agnostic Language for Exploration and explanation. sklearn R inspired/ported lib
  • Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
  • Alibi - Algorithms for monitoring and explaining machine learning models.
  • anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
  • aequitas - Bias and Fairness Audit Toolkit.
  • Contrastive Explanation - Contrastive Explanation (Foil Trees). sklearn
  • yellowbrick - Visual analysis and diagnostic tools to facilitate machine learning model selection. sklearn
  • scikit-plot - An intuitive library to add plotting functionality to scikit-learn objects. sklearn
  • shap - A unified approach to explain the output of any machine learning model. sklearn
  • ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
  • Lime - Explaining the predictions of any machine learning classifier. sklearn
  • FairML - FairML is a python toolbox auditing the machine learning models for bias. sklearn
  • L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
  • PDPbox - Partial dependence plot toolbox.
  • PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
  • Skater - Python Library for Model Interpretation.
  • model-analysis - Model analysis tools for TensorFlow. sklearn
  • themis-ml - A library that implements fairness-aware machine learning algorithms. sklearn
  • treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions. sklearn
  • AI Explainability 360 - Interpretability and explainability of data and machine learning models.
  • Auralisation - Auralisation of learned features in CNN (for audio).
  • CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
  • lucid - A collection of infrastructure and tools for research in neural network interpretability.
  • Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
  • FlashLight - Visualization Tool for your NeuralNetwork.
  • tensorboard-pytorch - Tensorboard for PyTorch (and chainer, mxnet, numpy, ...).

Genetic Programming

Optimization

  • Optuna - A hyperparameter optimization framework.
  • pymoo - Multi-objective Optimization in Python.
  • pycma - Python implementation of CMA-ES.
  • Spearmint - Bayesian optimization.
  • BoTorch - Bayesian optimization in PyTorch. PyTorch based/compatible
  • scikit-opt - Heuristic Algorithms for optimization.
  • sklearn-genetic-opt - Hyperparameters tuning and feature selection using evolutionary algorithms. sklearn
  • SMAC3 - Sequential Model-based Algorithm Configuration.
  • Optunity - Is a library containing various optimizers for hyperparameter tuning.
  • hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn. sklearn
  • sklearn-deap - Use evolutionary algorithms instead of gridsearch in scikit-learn. sklearn
  • sigopt_sklearn - SigOpt wrappers for scikit-learn methods. sklearn
  • Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
  • SafeOpt - Safe Bayesian Optimization.
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface.
  • Solid - A comprehensive gradient-free optimization framework written in Python.
  • PySwarms - A research toolkit for particle swarm optimization in Python.
  • Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
  • GPflowOpt - Bayesian Optimization using GPflow. sklearn
  • POT - Python Optimal Transport library.
  • Talos - Hyperparameter Optimization for Keras Models.
  • nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).
  • OR-Tools - An open-source software suite for optimization by Google; provides a unified programming interface to a half dozen solvers: SCIP, GLPK, GLOP, CP-SAT, CPLEX, and Gurobi.

Feature Engineering

General

  • Featuretools - Automated feature engineering.
  • Feature Engine - Feature engineering package with sklearn-like functionality. sklearn
  • OpenFE - Automated feature generation with expert-level performance.
  • skl-groups - A scikit-learn addon to operate on set/"group"-based features. sklearn
  • Feature Forge - A set of tools for creating and testing machine learning features. sklearn
  • few - A feature engineering wrapper for sklearn. sklearn
  • scikit-mdr - A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. sklearn
  • tsfresh - Automatic extraction of relevant features from time series. sklearn
  • dirty_cat - Machine learning on dirty tabular data (especially: string-based variables for classifcation and regression). sklearn
  • NitroFE - Moving window features. sklearn
  • sk-transformer - A collection of various pandas & scikit-learn compatible transformers for all kinds of preprocessing and feature engineering steps pandas compatible

Feature Selection

  • scikit-feature - Feature selection repository in Python.
  • boruta_py - Implementations of the Boruta all-relevant feature selection method. sklearn
  • BoostARoota - A fast xgboost feature selection algorithm. sklearn
  • scikit-rebate - A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning. sklearn
  • zoofs - A feature selection library based on evolutionary algorithms.

Visualization

General Purposes

  • Matplotlib - Plotting with Python.
  • seaborn - Statistical data visualization using matplotlib.
  • prettyplotlib - Painlessly create beautiful matplotlib plots.
  • python-ternary - Ternary plotting library for Python with matplotlib.
  • missingno - Missing data visualization module for Python.
  • chartify - Python library that makes it easy for data scientists to create charts.
  • physt - Improved histograms.

Interactive plots

  • animatplot - A python package for animating plots built on matplotlib.
  • plotly - A Python library that makes interactive and publication-quality graphs.
  • Bokeh - Interactive Web Plotting for Python.
  • Altair - Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
  • bqplot - Plotting library for IPython/Jupyter notebooks
  • pyecharts - Migrated from Echarts, a charting and visualization library, to Python's interactive visual drawing library.pyecharts echarts

Map

  • folium - Makes it easy to visualize data on an interactive open street map
  • geemap - Python package for interactive mapping with Google Earth Engine (GEE)

Automatic Plotting

  • HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
  • AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
  • SweetViz: Visualize and compare datasets, target values and associations, with one line of code.

NLP

  • pyLDAvis: Visualize interactive topic model

Deployment

  • fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
  • streamlit - Make it easy to deploy the machine learning model
  • streamsync - No-code in the front, Python in the back. An open-source framework for creating data apps.
  • gradio - Create UIs for your machine learning model in Python in 3 minutes.
  • Vizro - A toolkit for creating modular data visualization applications.
  • datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
  • binder - Enable sharing and execute Jupyter Notebooks

Statistics

  • pandas_summary - Extension to pandas dataframes describe function. pandas compatible
  • Pandas Profiling - Create HTML profiling reports from pandas DataFrame objects. pandas compatible
  • statsmodels - Statistical modeling and econometrics in Python.
  • stockstats - Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline stock statistics/indicators support.
  • weightedcalcs - A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
  • scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.
  • Alphalens - Performance analysis of predictive (alpha) stock factors.

Data Manipulation

Data Frames

  • pandas - Powerful Python data analysis toolkit.
  • polars - A fast multi-threaded, hybrid-out-of-core DataFrame library.
  • Arctic - High-performance datastore for time series and tick data.
  • datatable - Data.table for Python. R inspired/ported lib
  • pandas_profiling - Create HTML profiling reports from pandas DataFrame objects
  • cuDF - GPU DataFrame Library. pandas compatible GPU accelerated
  • blaze - NumPy and pandas interface to Big Data. pandas compatible
  • pandasql - Allows you to query pandas DataFrames using SQL syntax. pandas compatible
  • pandas-gbq - pandas Google Big Query. pandas compatible
  • xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
  • pysparkling - A pure Python implementation of Apache Spark's RDD and DStream interfaces. Apache Spark based
  • modin - Speed up your pandas workflows by changing a single line of code. pandas compatible
  • swifter - A package that efficiently applies any function to a pandas dataframe or series in the fastest available manner.
  • pandas-log - A package that allows providing feedback about basic pandas operations and finds both business logic and performance issues.
  • vaex - Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
  • xarray - Xarray combines the best features of NumPy and pandas for multidimensional data selection by supplementing numerical axis labels with named dimensions for more intuitive, concise, and less error-prone indexing routines.

Pipelines

Data-centric AI

  • cleanlab - The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
  • snorkel - A system for quickly generating training data with weak supervision.
  • dataprep - Collect, clean, and visualize your data in Python with a few lines of code.

Synthetic Data

Distributed Computing

  • Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. sklearn
  • PySpark - Exposes the Spark programming model to Python. Apache Spark based
  • Veles - Distributed machine learning platform.
  • Jubatus - Framework and Library for Distributed Online Machine Learning.
  • DMTK - Microsoft Distributed Machine Learning Toolkit.
  • PaddlePaddle - PArallel Distributed Deep LEarning.
  • dask-ml - Distributed and parallel machine learning. sklearn
  • Distributed - Distributed computation in Python.

Experimentation

  • mlflow - Open source platform for the machine learning lifecycle.
  • Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
  • dvc - Data Version Control | Git for Data & Models | ML Experiments Management.
  • envd - πŸ•οΈ machine learning development environment for data science and AI/ML engineering teams.
  • Sacred - A tool to help you configure, organize, log, and reproduce experiments.
  • Ax - Adaptive Experimentation Platform. sklearn

Data Validation

  • great_expectations - Always know what to expect from your data.
  • pandera - A lightweight, flexible, and expressive statistical data testing library.
  • deepchecks - Validation & testing of ML models and data during model development, deployment, and production. sklearn
  • evidently - Evaluate and monitor ML models from validation to production.
  • TensorFlow Data Validation - Library for exploring and validating machine learning data.
  • DataComPy- A library to compare Pandas, Polars, and Spark data frames. It provides stats and lets users adjust for match accuracy.

Evaluation

  • recmetrics - Library of useful metrics and plots for evaluating recommender systems.
  • Metrics - Machine learning evaluation metric.
  • sklearn-evaluation - Model evaluation made easy: plots, tables, and markdown reports. sklearn
  • AI Fairness 360 - Fairness metrics for datasets and ML models, explanations, and algorithms to mitigate bias in datasets and models.

Computations

  • numpy - The fundamental package needed for scientific computing with Python.
  • Dask - Parallel computing with task scheduling. pandas compatible
  • bottleneck - Fast NumPy array functions written in C.
  • CuPy - NumPy-like API accelerated with CUDA.
  • scikit-tensor - Python library for multilinear algebra and tensor factorizations.
  • numdifftools - Solve automatic numerical differentiation problems in one or more variables.
  • quaternion - Add built-in support for quaternions to numpy.
  • adaptive - Tools for adaptive and parallel samping of mathematical functions.
  • NumExpr - A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.

Web Scraping

  • BeautifulSoup: The easiest library to scrape static websites for beginners
  • Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the core
  • Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
  • Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
  • twitterscraper: Efficient library to scrape Twitter

Spatial Analysis

Quantum Computing

  • qiskit - Qiskit is an open-source SDK for working with quantum computers at the level of circuits, algorithms, and application modules.
  • cirq - A python framework for creating, editing, and invoking Noisy Intermediate Scale Quantum (NISQ) circuits.
  • PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
  • QML - A Python Toolkit for Quantum Machine Learning.

Conversion

  • sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
  • ONNX - Open Neural Network Exchange.
  • MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
  • treelite - Universal model exchange and serialization format for decision tree forests.

Contributing

Contributions are welcome! 😎
Read the contribution guideline.

License

This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0

Releases

No releases published

Sponsor this project

Packages

No packages published

Contributors 31

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /