304 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
1
vote
1
answer
64
views
How to nest dask.delayed functions within other dask.delayed functions
I am trying to learn dask, and have created the following toy example of a delayed pipeline.
+-----+ +-----+ +-----+
| baz +--+ bar +--+ foo |
+-----+ +-----+ +-----+
So baz has a dependency on ...
0
votes
0
answers
74
views
Importing SQL Table from Snowflake into Jupyter using Dask
I have an SQL Table in Snowflake,100K rows and 15 Columns. I want to import this table into my Jupyter notebook using Dask for further analysis. Primarily doing this a form of practice since I am new ...
2
votes
0
answers
40
views
How to speed up a dask delayed compute of a large dictionary?
I need to run a random forest classifier that I've put into a function ~ 10,000 times - because I sample randomly each time. I am trying to use dask delayed on a slurm-scheduled HPC cluster. My script ...
0
votes
1
answer
116
views
How can I keep Dask workers busy when processing large datasets to prevent them from running out of tasks?
I'm trying to process a large dataset (around 1 million tasks) using Dask distributed computing in Python. (I am getting data from a database to process it, and I am retriving around 1M rows). Here I ...
0
votes
1
answer
50
views
How to use a dask cluster as a scheduler for dask.compute
I have a class that has something like the following context manager to create a dask client & cluster:
class some_class():
def __init__(self,engine_kwargs: dict = None):
self....
0
votes
0
answers
73
views
Xarray out-of-memory computations operations
Context: I have 4 xarray datasets that are 8Gb, 45Gb, 8Gb and 20Gb (80Gb total). They all have 1 3D variable with axis: time, y, x. I want to combine them and save the output on disk.
Operation on ...
0
votes
1
answer
72
views
Compute can't handle <NA> values in column that hat a float dtype in a dask dataframe
Every time I try to compute the dataframe it fails giving me the following or similar error messages:
Exception: 'ValueError("could not convert string to float: \'<NA>\'")'
Right now, ...
0
votes
1
answer
86
views
How to prevent from_delayed in Dask from creating one partition per input?
My code is meant to match names of two large datasets. The function I use creates a delayed list of matched names.
After applying from_delayed the number of partitions increases and is equal to the ...
1
vote
1
answer
66
views
Looking to process 1d linear interpolation on a 3D gridded dataset
This is a follow-up question to an earlier question: Implementing 1D interpolation on a 3D Array in Numpy or Xarray
Tsoil is a 3D xarray dataset with the following dimensions:
<xarray.DataArray '...
0
votes
0
answers
74
views
in Dask method from_delayed returns a scalar instead of dataframe
I have a following problem - I have a list of delayed objects after applying following code (see below):
When I am applying
ddf = dd.from_delayed(lazy_results_names)
instead of dask dataframe I ...
0
votes
0
answers
118
views
Is there a way to analyze the dask worker killed?
I have ~30GB uncompressed spatial data, it contains id, tags, and coordinates as three columns in parquet file with row group size 64MB.
I used dask read_parquet with block_size 32MiB got 118 ...
0
votes
2
answers
281
views
reading multiple csv.gz file into dask dataframe
I have multiple .csv.gz files which I'm trying to read into a dask dataframe, I was able to achive this using this code :
file_paths = glob.glob(file_pattern)
@delayed
def read_csv(file_paths):
...
0
votes
0
answers
146
views
How to tackle Dask unmanaged memory in Windows OS when using delayed functions?
I have the below traditional Python function, without any array-type flavour, but which I need to run many times. Hence, I used Dask-parallelization using dask.delayed. However, I can see a gradual ...
0
votes
1
answer
266
views
compute() command doesnt work on dask series in python
I'm trying to compute pairwise rations on a large scale data where each column is a separated sample like this (this is a small example):
0 1 2
0 34.04 56.55 49....
1
vote
0
answers
385
views
Disable the "nanny" when running Dask SSHCluster
Consider an SSHCluster with multiple hosts.
cluster = SSHCluster(["localhost", "hostname"],
connect_options={"known_hosts": None},
...