ElasticBatch

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

[画像:dkaslovsky logo]

Source Code Changelog

Suggest Changes

Popularity

0.9

Growing

Activity

0.0

Stable

Stars 21

Watchers 1

Forks 2

Last Commit about 6 years ago

Description

ElasticBatch is a tool for collecting and batch inserting Python data and pandas DataFrames into Elasticsearch, allowing users to focus on the other aspects of data processing.

Programming language: Python

License: MIT License

Tags: Science And Data Analysis Pandas Elasticsearch

Latest version: v1.0.0

ElasticBatch alternatives and similar packages

Based on the "Science and Data Analysis" category.
Alternatively, view ElasticBatch alternatives based on common mentions on social networks and blogs.

Pandas

9.9 9.9 L2 ElasticBatch VS Pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

pandas-dev logo
NumPy

9.8 10.0 L1 ElasticBatch VS NumPy

The fundamental package for scientific computing with Python.

numpy logo

InfluxDB – Built for High-Performance Time Series Workloads

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

Promo www.influxdata.com

[画像:InfluxDB Logo]

SciPy

9.4 10.0 L2 ElasticBatch VS SciPy

SciPy library main repository

scipy logo
SymPy

9.4 9.9 L2 ElasticBatch VS SymPy

A computer algebra system written in pure Python

sympy logo
NetworkX

9.3 9.6 L3 ElasticBatch VS NetworkX

Network Analysis in Python

networkx logo
Dask

9.2 9.4 L2 ElasticBatch VS Dask

Parallel computing with task scheduling

dask logo
statsmodels

9.2 9.5 L3 ElasticBatch VS statsmodels

Statsmodels: statistical modeling and econometrics in Python

statsmodels logo
Getting Started

9.1 5.9 ElasticBatch VS Getting Started

PyGWalker: Turn your dataframe into an interactive UI for visual analysis

Kanaries logo
PyMC

8.9 9.3 L4 ElasticBatch VS PyMC

Bayesian Modeling and Probabilistic Programming in Python

pymc-devs logo
Numba

8.8 9.8 L3 ElasticBatch VS Numba

NumPy aware dynamic Python compiler using LLVM

numba logo
astropy

8.4 9.9 L2 ElasticBatch VS astropy

Astronomy and astrophysics core library

astropy logo
Biopython

8.3 9.1 L2 ElasticBatch VS Biopython

Official git repository for Biopython (originally converted from CVS)

biopython logo
orange

8.2 9.6 L2 ElasticBatch VS orange

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

biolab logo
RDKit

7.5 9.6 L1 ElasticBatch VS RDKit

The official sources for the RDKit library

rdkit logo
Statsforecast

7.5 7.4 ElasticBatch VS Statsforecast

Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla logo
Interactive Parallel Computing with IPython

7.3 7.8 L3 ElasticBatch VS Interactive Parallel Computing with IPython

IPython Parallel: Interactive Parallel Computing in Python

ipython logo
blaze

7.1 0.0 L4 ElasticBatch VS blaze

NumPy and Pandas interface to Big Data

blaze logo
Cubes

5.8 0.0 L3 ElasticBatch VS Cubes

[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis

DataBrewery logo
#<Sawyer::Resource:0x00007f547e829e00>

5.8 5.5 ElasticBatch VS #<Sawyer::Resource:0x00007f547e829e00>

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

fugue-project logo
Open Mining

5.7 0.0 L3 ElasticBatch VS Open Mining

DISCONTINUED. Business Intelligence (BI) in Python, OLAP

mining logo
NIPY

5.4 6.7 L3 ElasticBatch VS NIPY

Workflows and interfaces for neuroimaging packages

nipy logo
bcbio-nextgen

5.4 6.2 L3 ElasticBatch VS bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

bcbio logo
bcolz

4.7 0.0 ElasticBatch VS bcolz

DISCONTINUED. A columnar data container that can be compressed.

Blosc logo
bccb

4.5 4.4 L4 ElasticBatch VS bccb

Incubator for useful bioinformatics code, primarily in Python and R

chapmanb logo
Neupy

4.4 0.0 L5 ElasticBatch VS Neupy

NeuPy is a Tensorflow based python library for prototyping and building neural networks

itdxer logo
Bubbles

3.7 0.0 L5 ElasticBatch VS Bubbles

[NOT MAINTAINED] Bubbles – Python ETL framework

Stiivi logo
PyDy

3.6 9.0 L3 ElasticBatch VS PyDy

Multibody dynamics tool kit.

pydy logo
harold

2.5 1.8 L2 ElasticBatch VS harold

An open-source systems and controls toolbox for Python3

ilayn logo
signac

2.5 8.4 ElasticBatch VS signac

Manage large and heterogeneous data spaces on the file system.

glotzerlab logo
PatZilla

2.3 1.8 ElasticBatch VS PatZilla

PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.

ip-tools logo
LynxKite

2.2 7.1 ElasticBatch VS LynxKite

The complete graph data science platform

lynxkite logo
Kotori

2.1 2.0 ElasticBatch VS Kotori

A flexible data historian based on InfluxDB, Grafana, MQTT, and more. Free, open, simple.

daq-tools logo
Terkin

1.8 0.0 ElasticBatch VS Terkin

Datalogger for MicroPython and CPython.

hiveeyes logo
dask-memusage

0.9 0.0 ElasticBatch VS dask-memusage

A low-impact profiler to figure out how much memory each task in Dask is using

itamarst logo
cclib

0.9 ElasticBatch VS cclib

A library for parsing and interpreting the results of computational chemistry packages.
Open Babel

- ElasticBatch VS Open Babel

A chemical toolbox designed to speak the many languages of chemical data.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of ElasticBatch or a related project?

Add another 'Science and Data Analysis' Package

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.

featured getstream.io

Popular Comparisons

SaaSHub - Software Alternatives and Reviews

featured www.saashub.com

README

ElasticBatch

Elasticsearch buffer for collecting and batch inserting Python data and pandas DataFrames

Build Status Coverage Status PyPI - Python Version

Overview

ElasticBatch makes it easy to efficiently insert batches of data in the form of Python dictionaries or pandas DataFrames into Elasticsearch. An efficient pattern when processing data bound for Elasticsearch is to collect data records ("documents") in a buffer to be bulk-inserted in batches. ElasticBatch provides this functionality to ease the overhead and reduce the code involved in inserting large batches or streams of data into Elasticsearch.

ElasticBatch has been tested with Elasticsearch 7.x, but should work with earlier versions.

Features

ElasticBatch implements the following features (see Usage for examples and more details) that allow a user to:

Work with documents as lists of dicts or as rows of pandas DataFrames
Add documents to a buffer that will automatically flush (insert its contents to Elasticsearch) when it is full
Interact with an intuitive interface that handles all of the underlying Elasticsearch client logic on behalf of the user
Track the elapsed time a document has been in the buffer, allowing a user to flush the buffer at a desired time interval even when it is not full
Work within a context manager that will automatically flush before exiting, alleviating the need for extra code to ensure all documents are written to the database
Optionally dump the buffer contents (documents) to a file before exiting due to an uncaught exception
Automatically add Elasticsearch metadata fields (e.g., _index, _id) to each document via user-supplied functions

Installation

This package is hosted on PyPI and can be installed via pip:

To install with the ability to process pandas DataFrames: $ pip install elasticbatch[pandas]
For a more lightweight installation with only the ability to process native Python dicts: $ pip install elasticbatch The only dependency of the latter is elasticsearch whereas the former will also install pandas as a dependency.

To instead install from source:

$ git clone https://github.com/dkaslovsky/ElasticBatch.git
$ cd ElasticBatch
$ pip install ".[pandas]"

To install from source without the pandas dependency, replace the last line above with

$ pip install .

Usage

Basic Usage

Start by importing the ElasticBuffer class:

>>> from elasticbatch import ElasticBuffer

ElasticBuffer uses sensible defaults when initialized without parameters:

>>> esbuf = ElasticBuffer()

Alternatively, one can pass any of the following parameters:

size: (int) number of documents the buffer can hold before flushing to Elasticsearch; defaults to 5000.
client_kwargs: (dict) configuration passed to the underlying elasticsearch.Elasticsearch client; see the Elasticsearch documentation for all available options.
bulk_kwargs: (dict) configuration passed to the underlying call to elasticsearch.helpers.bulk for bulk insertion; see the Elasticsearch documentation for all available options.
verbose_errs: (bool) whether verbose (True, default) or truncated (False) exceptions are raised; see Exception Handling for more details.
dump_dir: (str) directory to write buffer contents when exiting context due to raised Exception; defaults to None for not writing to file.
**metadata_funcs: (callable) functions to apply to each document for adding Elasticsearch metadata.; see Automatic Elasticsearch Metadata Fields for more details.

Once initialized, ElasticBuffer exposes two methods, add and flush. Use add to add documents to the buffer, noting that all documents in the buffer will be flushed and inserted into Elasticsearch once the number of documents exceeds the buffer's size:

>>> docs = [
 {'_index': 'my-index', 'a': 1, 'b': 2.1, 'c': 'xyz'},
 {'_index': 'my-index', 'a': 3, 'b': 4.1, 'c': 'xyy'},
 {'_index': 'my-other-index', 'a': 5, 'b': 6.1, 'c': 'zzz'},
 {'_index': 'my-other-index', 'a': 7, 'b': 8.1, 'c': 'zyx'},
 ]
>>> esbuf.add(docs)

Note that all metadata fields required for indexing into Elasticsearch (e.g., _index above) must either be included in each document or added programmatically via callable kwarg parameters supplied to the ElasticBuffer instance (see below).

To manually force a buffer flush and insert all documents to Elasticsearch, use the flush method which does not accept any arguments:

>>> esbuf.flush()

A third method, show(), exists mostly for debug purposes and prints all documents currently in the buffer as newline-delimited json.

pandas DataFrames

One can directly insert a pandas DataFrame into the buffer and each row will be treated as a document:

>>> import pandas as pd
>>> df = pd.DataFrame(docs)
>>> print(df)
 _index a b c
0 my-index 1 2.1 xyz
1 my-index 3 4.1 xyy
2 my-other-index 5 6.1 zzz
3 my-other-index 7 8.1 zyx
>>> esbuf.add(df)

The DataFrame's index (referring to df.index and not the column named _index) is ignored unless it is named, in which case it is added as an ordinary field (column).

Context Manager

ElasticBuffer can also be used as a context manager, offering the advantages of automatically flushing the remaining buffer contents when exiting scope as well as optionally dumping the buffer contents to a file before exiting due to an unhandled exception.

>>> with ElasticBuffer(size=100, dump_dir='/tmp') as esbuf:
 for doc in document_stream:
 doc = process_document(doc) # some user-defined application-specific processing function
 esbuf.add(doc)

Elapsed Time

When using ElasticBuffer in a service consuming messages from some external source, it can be important to track how long messages have been waiting in the buffer to be flushed. In particular, a user may wish to flush, say, every hour to account for the situation where only a trickle of data is coming in and the buffer is not filling up. ElasticBuffer provides the elapsed time (in seconds) that its oldest message has been in the buffer:

>>> esbuf.oldest_elapsed_time
5.687833070755005 # the oldest message was inserted ~5.69 seconds ago

This information can be used to periodically check the elapsed time of the oldest message and force a flush if it exceeds a desired threshold.

Automatic Elasticsearch Metadata Fields

An ElasticBuffer instance can be initialized with kwargs corresponding to callable functions to add Elasticsearch metadata fields to each document added to the buffer:

>>> def my_index_func(doc): return 'my-index'
>>> def my_id_func(doc): return sum(doc.values())
>>> esbuf = ElasticBuffer(_index=my_index_func, _id=my_id_func)
>>> docs = [
 {'a': 1, 'b': 2},
 {'a': 8, 'b': 9},
 ]
>>> esbuf.add(docs)
>>> esbuf.show()
{"a": 1, "b": 2, "_index": "my-index", "_id": 3}
{"a": 8, "b": 9, "_index": "my-index", "_id": 17}

Callable kwargs add key/value pairs to each document, where the key corresponds to the name of the kwarg and the value is the function's return value. Each function must accept one argument (the document as a dict) and return one value. This also works for DataFrames, as they are transformed to documents (dicts) before applying the supplied metadata functions.

The key/value pairs are added to the top-level of each document. Note that the user need not add documents with data nested under a _source key, as metadata fields can be handled at the same level as the data fields. For further details, see the underlying Elasticsearch client bulk insert documentation on handling of metadata fields in flat dicts.

Exception Handling

For exception handing, ElasticBatch provides the base exception ElasticBatchError:

>>> from elasticbatch import ElasticBatchError

as well as the more specific ElasticBufferFlushError raised on errors flushing to Elasticsearch:

>>> from elasticbatch.exceptions import ElasticBufferFlushError

Elasticsearch exception messages can contain a copy of every document related to a failed bulk insertion request. As such messages can be very large, the verbose_errors flag can be used to optionally truncate the error message. When ElasticBuffer is initialized with verbose_errors=True, the entirety of the error message is returned. When verbose_errors=False, a shorter, descriptive message is returned. In both cases, the full, potentially verbose, exception is available via the err property on the raised ElasticBufferFlushError.

Tests

To run tests:

$ python -m unittest discover -v

The awesome green package is also highly recommended for running tests and reporting test coverage:

$ green -vvr

Do not miss the trending, packages, news and articles with our weekly report.

Awesome Python is part of the LibHunt network. Terms. Privacy Policy.

(CC)

BY-SA

We recommend Spin The Wheel Of Names for a cryptographically secure random name picker.