dask

importing dask_cuda results in parse_memory_limit error

importing dask_cuda results in parse_memory_limit error Question: I’m trying to import dask_cuda as the examples: from dask_cuda import LocalCUDACluster from dask.distributed import Client But I receive the following error: ImportError Traceback (most recent call last) Input In [3], in <cell line: 1>() —-> 1 from dask_cuda import LocalCUDACluster File ~/miniconda3/lib/python3.8/site-packages/dask_cuda/__init__.py:5, in <module> 2 import dask.dataframe.shuffle …

Total answers: 1

Defining `__iter__` method for a dask actor?

Defining `__iter__` method for a dask actor? Question: Is it possible for a dask actor to have an __iter__ method as defined by a class? Consider this example adapted from the docs: class Counter: """A simple class to manage an incrementing counter""" def __init__(self): self.n = 0 def increment(self): self.n += 1 return self.n def …

Total answers: 1

AWS instance/controller node randomly unable to find files on FSX that is there

AWS instance/controller node randomly unable to find files on FSX that is there Question: This is a sporadic issue that I could not figure out a condition to replicate. The gist of the issue is that instance/controller node will randomly fail to find files that are already created on Amazon FSx. A sample script can …

Total answers: 1

Creating and Merging Multiple Datasets Does Not Fit Into Memory, Use Dask?

Creating and Merging Multiple Datasets Does Not Fit Into Memory, Use Dask? Question: I’m not quite sure how to ask this question, but I need some clarification on how to make use of Dask’s ability to "handle datasets that don’t fit into memory", because I’m a little confused on how it works from the CREATION …

Total answers: 1

randomly accessing a row of Dask dataframe is taking a long time

randomly accessing a row of Dask dataframe is taking a long time Question: I have a Dask dataframe of 100 million rows of data. I am trying to iterate over this dataframe without loading the entire dataframe to RAM. For an experiment, trying to access row of index equal to 1. %time dask_df.loc[1].compute() The time …

Total answers: 4

Dealing with huge pandas data frames

Dealing with huge pandas data frames Question: I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to …

Total answers: 1

How does dask know variable states before it runs map_partitions?

How does dask know variable states before it runs map_partitions? Question: In the dask code below I set x with 1 and 2 right before executing two map_partitions. The result seems fine, however I don’t fully understand it. If dask waits to run the two map_partitions only when it finds the compute(), and at the …

Total answers: 1

Replacing existing column in dask map_partitions gives SettingWithCopyWarning

Replacing existing column in dask map_partitions gives SettingWithCopyWarning Question: I’m replacing column id2 in a dask dataframe using map_partitions. The result is that the values are replaced but with a pandas warning. What is this warning and how to apply the .loc suggestion in the example below? pdf = pd.DataFrame({ ‘dummy2’: [10, 10, 10, 20, …

Total answers: 1

order of metadata in dask groupby apply

order of metadata in dask groupby apply Question: In dask I am getting the error: "ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match" This does not make sense to me as I do provide metadata that is correct. It is not …

Total answers: 1

does dask compute store results?

does dask compute store results? Question: Consider the following code import dask import dask.dataframe as dd import pandas as pd data_dict = {‘data1’:[1,2,3,4,5,6,7,8,9,10]} df_pd = pd.DataFrame(data_dict) df_dask = dd.from_pandas(df_pd,npartitions=2) df_dask[‘data1x2’] = df_dask[‘data1’].apply(lambda x:2*x,meta=(‘data1x2′,’int64’)).compute() print(‘-‘*80) print(df_dask[‘data1x2’]) print(‘-‘*80) print(df_dask[‘data1x2’].compute()) print(‘-‘*80) What I can’t figure out is: why is there a difference between the output of the first …

Total answers: 2