Initializing an empty DataFrame and appending rows

Question:

Different from creating an empty dataframe and populating rows later , I have many many dataframes that needs to be concatenated.

If there were only two data frames, I can do this:

df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))

df1.append(df2, ignore_index=True)

Imagine I have millions of df that needs to be appended/concatenated each time I read a new file into a DataFrame object.

But when I tried to initialize an empty dataframe and then adding the new dataframes through a loop:

import pandas as pd
alldf = pd.DataFrame(, columns=list('AB'))
for filename in os.listdir(indir):
    df = pd.read_csv(indir+filename, delimiter=' ')
    alldf.append(df, ignore_index=True)

This would return an empty alldf with only the header row, e.g.

alldf = pd.DataFrame(columns=list('AB'))
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
for df in [df1, df2]:
    alldf.append(df, ignore_index=True)
Asked By: alvas

||

Answers:

From @DSM comment, this works:

import pandas as pd
dfs = []
for filename in os.listdir(indir):
    df = pd.read_csv(indir+filename, delimiter=' ')
    dfs.append(df)

alldf = pd.concat(dfs)
Answered By: alvas

df.concat() over an array of dataframes is probably the way to go, especially for clean CSVs. But in case you suspect your CSVs are either dirty or could get recognized by read_csv() with mixed types between files, you may want to explicity create each dataframe in a loop.

You can initialize a dataframe for the first file, and then each subsequent file start with an empty dataframe based on the first.

df2 = pd.DataFrame(data=None, columns=df1.columns,index=df1.index)

This takes the structure of dataframe df1 but no data, and create df2. If you want to force data type on columns, then you can do it to df1 when it is created, before its structure is copied.

more details

Answered By: philshem