dask-dataframe

order of metadata in dask groupby apply

order of metadata in dask groupby apply Question: In dask I am getting the error: "ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match" This does not make sense to me as I do provide metadata that is correct. It is not …

Total answers: 1

does dask compute store results?

does dask compute store results? Question: Consider the following code import dask import dask.dataframe as dd import pandas as pd data_dict = {‘data1’:[1,2,3,4,5,6,7,8,9,10]} df_pd = pd.DataFrame(data_dict) df_dask = dd.from_pandas(df_pd,npartitions=2) df_dask[‘data1x2’] = df_dask[‘data1’].apply(lambda x:2*x,meta=(‘data1x2′,’int64’)).compute() print(‘-‘*80) print(df_dask[‘data1x2’]) print(‘-‘*80) print(df_dask[‘data1x2’].compute()) print(‘-‘*80) What I can’t figure out is: why is there a difference between the output of the first …

Total answers: 2

Operating large .csv file with pandas/dask Python

Operating large .csv file with pandas/dask Python Question: I’ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times. Each row of the table looks like this: {F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A" I’ve never used pandas or any …

Total answers: 1

Dask – length mismatch when querying

Dask – length mismatch when querying Question: I am trying to import lot of csv’s into a single dataframe and would like to filter the data after a specific date. Its throwing below error not sure what’s wrong. Is it because there is a mismatch in columns? If yes is there a way to read …

Total answers: 1

Get column value after searching for row in dask

Get column value after searching for row in dask Question: I have a pandas dataframe that I converted to a dask dataframe using the from_pandas function of dask. It has 3 columns namely col1, col2 and col3. Now I am searching for a specific row using daskdf[(daskdf.col1 == v1) & (daskdf.col2 == v2)] where v1 …

Total answers: 1

Dask Dataframe nunique operation: Worker running out of memory (MRE)

Dask Dataframe nunique operation: Worker running out of memory (MRE) Question: tl;dr I want to dd.read_parquet(‘*.parq’)[‘column’].nunique().compute() but I get WARNING – Worker exceeded 95% memory budget. Restarting a couple of times before the workers get killed altogether. Long version I have a dataset with 10 billion rows, ~20 columns, and a single machine with around …

Total answers: 1

Ways of Creating List from Dask dataframe column

Ways of Creating List from Dask dataframe column Question: I want to create a list/set from Dask Dataframe column. Basically, i want to use this list to filter rows in another dataframe by matching values with a column in this dataframe. I have tried using list(df[column]) and set(df[column]) but it takes lot of time and …

Total answers: 2

Dask crashing when saving to file?

Dask crashing when saving to file? Question: I’m trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and …

Total answers: 1

How to create unique index in Dask DataFrame?

How to create unique index in Dask DataFrame? Question: Imagine I have a Dask DataFrame from read_csv or created another way. How can I make a unique index for the dask dataframe? Note: reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,… ) for Partition 1, (0,1,2,3,4,5,… ) for Partition 2, (0,1,2,3,4,5,… …

Total answers: 2