API#
Create Bags#
from_sequence(seq[, partition_size, npartitions])
Create a dask Bag from Python sequence.
from_delayed(values)
Create bag from many dask Delayed objects.
from_url(urls)
Create a dask Bag from a url.
range(n, npartitions)
Numbers from zero to n
read_text(urlpath[, blocksize, compression, ...])
Read lines from text files
read_avro(urlpath[, blocksize, ...])
Read set of avro files
From dataframe#
DataFrame.to_bag([index, format])
Create a Dask Bag from a Series
Series.to_bag([index, format])
Create a Dask Bag from a Series
Top-level functions#
concat(bags)
Concatenate many bags together, unioning all elements.
map(func, *args, **kwargs)
Apply a function elementwise across one or more bags.
map_partitions(func, *args, **kwargs)
Apply a function to every partition across one or more bags.
to_textfiles(b, path[, name_function, ...])
Write dask Bag to disk, one filename per partition, one line per element.
zip(*bags)
Partition-wise bag zip
Random Sampling#
random.choices(population[, k, split_every])
Return a k sized list of elements chosen with replacement.
random.sample(population, k[, split_every])
Chooses k unique random elements from a bag.
Turn Bags into other things#
Bag.to_textfiles(path[, name_function, ...])
Write dask Bag to disk, one filename per partition, one line per element.
Bag.to_dataframe([meta, columns, optimize_graph])
Create Dask Dataframe from a Dask Bag.
Bag.to_delayed([optimize_graph])
Convert into a list of dask.delayed objects, one per partition.
Bag.to_avro(filename, schema[, ...])
Write bag to set of avro files
Bag Methods#
Bag(dsk, name, npartitions)
Parallel collection of Python objects
Bag.accumulate(binop[, initial])
Repeatedly apply binary function to a sequence, accumulating results.
Bag.all([split_every])
Are all elements truthy?
Bag.any([split_every])
Are any of the elements truthy?
Bag.compute(**kwargs)
Compute this dask collection
Bag.count([split_every])
Count the number of elements.
Bag.distinct([key])
Distinct elements of collection
Bag.filter(predicate)
Filter elements in collection by a predicate function.
Concatenate nested lists into one long list.
Bag.fold(binop[, combine, initial, ...])
Parallelizable reduction
Bag.foldby(key, binop[, initial, combine, ...])
Combined reduction and groupby.
Bag.frequencies([split_every, sort])
Count number of occurrences of each distinct element.
Bag.groupby(grouper[, method, npartitions, ...])
Group collection by key function
Bag.join(other, on_self[, on_other])
Joins collection with another collection.
Bag.map(func, *args, **kwargs)
Apply a function elementwise across one or more bags.
Bag.map_partitions(func, *args, **kwargs)
Apply a function to every partition across one or more bags.
Bag.max([split_every])
Maximum element
Bag.mean()
Arithmetic mean
Bag.min([split_every])
Minimum element
Bag.persist(**kwargs)
Persist this dask collection into memory
Bag.pluck(key[, default])
Select item from all tuples/dicts in collection.
Bag.product(other)
Cartesian product between two bags.
Bag.reduction(perpartition, aggregate[, ...])
Reduce collection with reduction operators.
Bag.random_sample(prob[, random_state])
Return elements from bag with probability of prob.
Bag.remove(predicate)
Remove elements in collection that match predicate.
Bag.repartition([npartitions, partition_size])
Repartition Bag across new divisions.
Bag.starmap(func, **kwargs)
Apply a function using argument tuples from the given bag.
Bag.std([ddof])
Standard deviation
Bag.sum([split_every])
Sum all elements
Bag.take(k[, npartitions, compute, warn])
Take the first k elements.
Bag.to_avro(filename, schema[, ...])
Write bag to set of avro files
Bag.to_dataframe([meta, columns, optimize_graph])
Create Dask Dataframe from a Dask Bag.
Bag.to_delayed([optimize_graph])
Convert into a list of dask.delayed objects, one per partition.
Bag.to_textfiles(path[, name_function, ...])
Write dask Bag to disk, one filename per partition, one line per element.
Bag.topk(k[, key, split_every])
K largest elements in collection
Bag.var([ddof])
Variance
Bag.visualize([filename, format, optimize_graph])
Render the computation of this object's task graph using graphviz.
Item Methods#
Item(dsk, key[, layer])
Item.apply(func)
Item.compute(**kwargs)
Compute this dask collection
Item.from_delayed(value)
Create bag item from a dask.delayed value.
Item.persist(**kwargs)
Persist this dask collection into memory
Item.to_delayed([optimize_graph])
Convert into a dask.delayed object.
Item.visualize([filename, format, ...])
Render the computation of this object's task graph using graphviz.