dask

Parse error when importing csv dataframe with dask and pandas

Parse error when importing csv dataframe with dask and pandas Question: I am trying to import a very large .csv file as: import dask.dataframe as dd import pandas as pd #TO DO dd_subf1_small = dd.read_csv(‘subf1_small.csv’, dtype={‘Unnamed: 0’: ‘float64′,’oecd_subfield’:’object’,’paperid’:’object’}, sep=None, engine = ‘python’).persist() but I am getting the following error: ————————————————————————— ParserError Traceback (most recent call …

Total answers: 1

dask-ml preprocessing raise AttributeError

dask-ml preprocessing raise AttributeError Question: I use Dask dataframe and dask-ml to manipulate my data. When I use dask-ml Min-max scaler, I get this error. Is there a way to prevent this error and make it work? import dask.dataframe as dd from dask_ml.preprocessing import MinMaxScaler df = dd.read_csv(‘path to csv’, parse_dates=[‘CREATED_AT’] , dtype={‘ODI_UPDATED_AT’: ‘object’}) scaler …

Total answers: 1

Disable pure function assumption in dask distributed

Disable pure function assumption in dask distributed Question: The Dask distributed library documentation says: By default, distributed assumes that all functions are pure. […] The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it. When benchmarking function runtimes, …

Total answers: 1

DASK: merge throws error when one side's key is NA whereas pd.merge works

DASK: merge throws error when one side's key is NA whereas pd.merge works Question: I have these sample dataframes: tdf1 = pd.DataFrame([{"id": 1, "val": 4}, {"id": 2, "val": 5}, {"id": 3, "val": 6}, {"id": pd.NA, "val": 7}, {"id": 4, "val": 8}]) tdf2 = pd.DataFrame([{"some_id": 1, "name": "Josh"}, {"some_id": 3, "name": "Jake"}]) pd.merge(tdf1, tdf2, how="left", left_on="id", …

Total answers: 1

Merging many to many Dask

Merging many to many Dask Question: say I have the following databases (suppose they are Dask data frames: df A = 1 1 2 2 2 2 3 4 5 5 5 5 5 5 df B = 1 2 2 3 3 3 4 5 5 5 and I would like to merge the …

Total answers: 1

How to store data from dask.distributed on disk?

How to store data from dask.distributed on disk? Question: I’m trying to scale my computations from local Dask Arrays to Dask Distributed. Unfortunately, I am new to distributed computed, so I could not adapt the answer here for my purpose. Mainly my problem is saving data from distributed computations back to an in-memory Zarr array …

Total answers: 1

How to control python dask's number of threads per worker in linux?

How to control python dask's number of threads per worker in linux? Question: I tried to use dask localcluster, in multiprocess but single thread per process setup, in linux, but failed so far: from dask.distributed import LocalCluster, Client, progress def do_work(): while True: pass return if __name__ == ‘__main__’: cluster = LocalCluster(n_workers=2, processes=True, threads_per_worker=1) client …

Total answers: 1

Fill NaNs with per-column max in dask dataframe

Fill NaNs with per-column max in dask dataframe Question: I need to impute in a dataframe the maximum number in each column when the value is np.nan. Unfortunatelly in SimpleImputer this strategy is not supported according to the documentation: https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html So I’m trying to do this manually with fillna. This is my attempt: df …

Total answers: 1

Create a Dataframe in Dask

Create a Dataframe in Dask Question: I’m just starting using Dask as a possible replacement (?) of pandas. The first think that hit me is that i can’t seem to find a way to create a dataframe from a couple lists/arrays. In regular pandas i just do: pd.DataFrame({‘a’:a,’b’:b,…}) but i can’t find an equivalent way …

Total answers: 1

create a multiple CSV in Dask with Differnet name

create a multiple CSV in Dask with Differnet name Question: I am using dask.dataframe.to_csv to write a CSV. I was expecting 13 CSVs. but it writes only 2 CSVs (rewriting the old one which i do not want) for i in range(13): df_temp=df1.groupby("temp").get_group(unique_cases[i]) df_temp.to_csv(path_store_csv+"*.csv") I also tried this but it did not work: for i …

Total answers: 1