Initializing an empty DataFrame and appending rows
Question:
Different from creating an empty dataframe and populating rows later , I have many many dataframes that needs to be concatenated.
If there were only two data frames, I can do this:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df1.append(df2, ignore_index=True)
Imagine I have millions of df
that needs to be appended/concatenated each time I read a new file into a DataFrame object.
But when I tried to initialize an empty dataframe and then adding the new dataframes through a loop:
import pandas as pd
alldf = pd.DataFrame(, columns=list('AB'))
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
alldf.append(df, ignore_index=True)
This would return an empty alldf
with only the header row, e.g.
alldf = pd.DataFrame(columns=list('AB'))
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
for df in [df1, df2]:
alldf.append(df, ignore_index=True)
Answers:
From @DSM comment, this works:
import pandas as pd
dfs = []
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
dfs.append(df)
alldf = pd.concat(dfs)
df.concat()
over an array of dataframes is probably the way to go, especially for clean CSVs. But in case you suspect your CSVs are either dirty or could get recognized by read_csv()
with mixed types between files, you may want to explicity create each dataframe in a loop.
You can initialize a dataframe for the first file, and then each subsequent file start with an empty dataframe based on the first.
df2 = pd.DataFrame(data=None, columns=df1.columns,index=df1.index)
This takes the structure of dataframe df1
but no data, and create df2
. If you want to force data type on columns, then you can do it to df1
when it is created, before its structure is copied.
Different from creating an empty dataframe and populating rows later , I have many many dataframes that needs to be concatenated.
If there were only two data frames, I can do this:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df1.append(df2, ignore_index=True)
Imagine I have millions of df
that needs to be appended/concatenated each time I read a new file into a DataFrame object.
But when I tried to initialize an empty dataframe and then adding the new dataframes through a loop:
import pandas as pd
alldf = pd.DataFrame(, columns=list('AB'))
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
alldf.append(df, ignore_index=True)
This would return an empty alldf
with only the header row, e.g.
alldf = pd.DataFrame(columns=list('AB'))
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
for df in [df1, df2]:
alldf.append(df, ignore_index=True)
From @DSM comment, this works:
import pandas as pd
dfs = []
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
dfs.append(df)
alldf = pd.concat(dfs)
df.concat()
over an array of dataframes is probably the way to go, especially for clean CSVs. But in case you suspect your CSVs are either dirty or could get recognized by read_csv()
with mixed types between files, you may want to explicity create each dataframe in a loop.
You can initialize a dataframe for the first file, and then each subsequent file start with an empty dataframe based on the first.
df2 = pd.DataFrame(data=None, columns=df1.columns,index=df1.index)
This takes the structure of dataframe df1
but no data, and create df2
. If you want to force data type on columns, then you can do it to df1
when it is created, before its structure is copied.