Dask – length mismatch when querying

Question

I am trying to import lot of csv’s into a single dataframe and would like to filter the data after a specific date.

Its throwing below error not sure what’s wrong.

Is it because there is a mismatch in columns? If yes is there a way to read all csv and perform a union in such a way that the dataframe would have all column names and doesnt show below error.

import dask.dataframe as dd
df = dd.read_csv('XXXXXXX*.csv',assume_missing=True)
df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))
filter_t=df_req[df_req['time']>='2020-11-21 21:22:19']
filter_t.head(npartitions=-1)

Asked By: HKE

||

Source

Answer 1

It’s not clear from the question, but if there’s a mismatch in columns, then using
dd.read_csv is not appropriate. One option is to write a custom delayed wrapper to enforce a specific column structure. This would roughly look like this:

# this is the list of columns that the final dataframe should contain
list_all_columns = ['a', 'b', 'c'] 

from dask import delayed
@delayed
def load_csv(f):
    df = pd.read_csv(f)
    for c in list_all_columns:
        if c not in df.columns:
             df[c] = np.nan
    return df

ddf = dd.from_delayed([load_csv(f) for f in glob('x*csv')])

# the rest of your workflow continues

Answered By: SultanOrazbayev

Dask – length mismatch when querying

Question:

Answers: