TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Question:
I have a big dataframe and I try to split that and after concat
that.
I use
df2 = pd.read_csv('et_users.csv', header=None, names=names2, chunksize=100000)
for chunk in df2:
chunk['ID'] = chunk.ID.map(rep.set_index('member_id')['panel_mm_id'])
df2 = pd.concat(chunk, ignore_index=True)
But it return an error
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
How can I fix that?
Answers:
IIUC you want the following:
df2 = pd.read_csv('et_users.csv', header=None, names=names2, chunksize=100000)
chunks=[]
for chunk in df2:
chunk['ID'] = chunk.ID.map(rep.set_index('member_id')['panel_mm_id'])
chunks.append(chunk)
df2 = pd.concat(chunks, ignore_index=True)
You need to append each chunk to a list and then use concat
to concatenate them all, also I think the ignore_index
may not be necessary but I may be wrong
I was getting the same issue, and just realised that we have to pass the (multiple!) dataframes as a LIST in the first argument instead of as multiple arguments!
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
a = pd.DataFrame()
b = pd.DataFrame()
c = pd.concat(a,b) # errors out:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
c = pd.concat([a,b]) # works.
If the processing action doesn’t require ALL the data to be present, then is no reason to keep saving all the chunks to an external array and process everything only after the chunking loop is over: that defeats the whole purpose of chunking. We use chunksize because we want to do the processing at each chunk and free up the memory for the next chunk.
In terms of OP’s code, they need to create another empty dataframe and concat the chunks into there.
df3 = pd.DataFrame() # create empty df for collecting chunks
df2 = pd.read_csv('et_users.csv', header=None, names=names2, chunksize=100000)
for chunk in df2:
chunk['ID'] = chunk.ID.map(rep.set_index('member_id')['panel_mm_id'])
df3 = pd.concat([df3,chunk], ignore_index=True)
print(df3)
However, I’d like to reiterate that chunking was invented precisely to avoid building up all the rows of the entire CSV into a single DataFrame, as that is what causes out-of-memory errors when dealing with large CSVs. We don’t want to just shift the error down the road from the pd.read_csv()
line to the pd.concat()
line. We need to craft ways to finish off the bulk of our data processing inside the chunking loop. In my own use case I’m eliminating away most of the rows using a df query and concatenating only the fewer required rows, so the final df is much smaller than the original csv.
Last line must be in following format:
df2=pd.concat([df1,df2,df3,df4,...], ignore_index=True)
The thing is dataframes to be concatenated need to be passed as list/tuple.
Like what they said, you need to pass it in as a list. Also, it may help to make sure it’s in a DataFrame prior to using concat.
i.e.
chunks = pd.DataFrame(chunks)
df2 = pd.concat([chunks], ignore_index=True)
finalexcelsheet = pd.DataFrame()
for file in filenames:
df = pd.read_excel(file, sheet_name='DL PRB')
finalexcelsheet = finalexcelsheet.append(
df, ignore_index=True)
# finalexcelsheet dataframe contain all files data.
I have a big dataframe and I try to split that and after concat
that.
I use
df2 = pd.read_csv('et_users.csv', header=None, names=names2, chunksize=100000)
for chunk in df2:
chunk['ID'] = chunk.ID.map(rep.set_index('member_id')['panel_mm_id'])
df2 = pd.concat(chunk, ignore_index=True)
But it return an error
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
How can I fix that?
IIUC you want the following:
df2 = pd.read_csv('et_users.csv', header=None, names=names2, chunksize=100000)
chunks=[]
for chunk in df2:
chunk['ID'] = chunk.ID.map(rep.set_index('member_id')['panel_mm_id'])
chunks.append(chunk)
df2 = pd.concat(chunks, ignore_index=True)
You need to append each chunk to a list and then use concat
to concatenate them all, also I think the ignore_index
may not be necessary but I may be wrong
I was getting the same issue, and just realised that we have to pass the (multiple!) dataframes as a LIST in the first argument instead of as multiple arguments!
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
a = pd.DataFrame()
b = pd.DataFrame()
c = pd.concat(a,b) # errors out:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
c = pd.concat([a,b]) # works.
If the processing action doesn’t require ALL the data to be present, then is no reason to keep saving all the chunks to an external array and process everything only after the chunking loop is over: that defeats the whole purpose of chunking. We use chunksize because we want to do the processing at each chunk and free up the memory for the next chunk.
In terms of OP’s code, they need to create another empty dataframe and concat the chunks into there.
df3 = pd.DataFrame() # create empty df for collecting chunks
df2 = pd.read_csv('et_users.csv', header=None, names=names2, chunksize=100000)
for chunk in df2:
chunk['ID'] = chunk.ID.map(rep.set_index('member_id')['panel_mm_id'])
df3 = pd.concat([df3,chunk], ignore_index=True)
print(df3)
However, I’d like to reiterate that chunking was invented precisely to avoid building up all the rows of the entire CSV into a single DataFrame, as that is what causes out-of-memory errors when dealing with large CSVs. We don’t want to just shift the error down the road from the pd.read_csv()
line to the pd.concat()
line. We need to craft ways to finish off the bulk of our data processing inside the chunking loop. In my own use case I’m eliminating away most of the rows using a df query and concatenating only the fewer required rows, so the final df is much smaller than the original csv.
Last line must be in following format:
df2=pd.concat([df1,df2,df3,df4,...], ignore_index=True)
The thing is dataframes to be concatenated need to be passed as list/tuple.
Like what they said, you need to pass it in as a list. Also, it may help to make sure it’s in a DataFrame prior to using concat.
i.e.
chunks = pd.DataFrame(chunks)
df2 = pd.concat([chunks], ignore_index=True)
finalexcelsheet = pd.DataFrame()
for file in filenames:
df = pd.read_excel(file, sheet_name='DL PRB')
finalexcelsheet = finalexcelsheet.append(
df, ignore_index=True)
# finalexcelsheet dataframe contain all files data.