Why the column order is changing while appending pandas dataframes?
Question:
I want to append (merge) all the csv files in a folder using Python pandas.
For example: Say folder has two csv files test1.csv
and test2.csv
as follows:
A_Id P_Id CN1 CN2 CN3
AAA 111 702 709 740
BBB 222 1727 1734 1778
and
A_Id P_Id CN1 CN2 CN3
CCC 333 710 750 750
DDD 444 180 734 778
So the python script I wrote was as follows:
#!/usr/bin/python
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("testfolder/*.csv"):
df = pd.read_csv(f)
all_data = all_data.append(df)
all_data.to_csv('testfolder/combined.csv')
Though the combined.csv
seems to have all the appended rows, it looks as follows:
CN1 CN2 CN3 A_Id P_Id
0 710 750 750 CCC 333
1 180 734 778 DDD 444
0 702 709 740 AAA 111
1 1727 1734 1778 BBB 222
Where as it should look like this:
A_ID P_Id CN1 CN2 CN2
AAA 111 702 709 740
BBB 222 1727 1734 1778
CCC 333 110 356 123
DDD 444 220 256 223
- Why are the first two columns moved to the end?
- Why is it appending in the first line rather than at the last line?
What am I missing? And how can I get get of 0s and 1s in the first column?
P.S: Since these are large csv files, I thought of using pandas.
Answers:
I tweaked the code as below. Comments in-line.
#!/usr/bin/python
import pandas as pd
import glob
# Grab all the csv files in the folder to a list.
fileList = glob.glob('input_folder/*.csv')
#Initialize an empty dataframe to grab the csv content.
all_data = pd.DataFrame()
#Initialize an empty list to grab the dataframes.
dfList= []
for files in fileList:
df = pd.read_csv(files, index_col = None, header= False)
dfList.append(df)
#The frames will be in reverse order i.e last read file's content in the begining. So reverse it again
Reversed_dfList = dfList[::-1]
CombinedFrame = pd.concat(Reversed_dfList)
# The "Combined.csv" file will have combination of all the files.
CombinedFrame.to_csv('output_folder/Combined.csv', index=False)
Try this …..
all_data = all_data.append(df)[df.columns.tolist()]
I had the same issue and it was painfull. I managed to solve it by reorganising columns based on source dataframe after it was appended to final dataframe. It would look like this:
#!/usr/bin/python
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("testfolder/*.csv"):
df = pd.read_csv(f)
all_data = all_data.append(df)
all_data = all_data[df.columns]
all_data.to_csv('testfolder/combined.csv')
Since your issue was from almost two years ago, I’m posting solution which worked for me for enyone else who will also face similar issue.
You can use reindex to change to the original order:
all_data = all_data.append(df)
all_data = all_data.reindex(df.columns, axis=1)
I saw this here (more details in the link): https://github.com/pandas-dev/pandas/issues/4588#issuecomment-44421883
Starting from version 0.23.0, you can prevent the append()
method to sort the final appended DataFrame. In your case:
all_data = all_data.append(df, sort=False)
all_data = all_data.append(df, ignore_index=True)
Put ignore_index=True so the order will not change.
By default is False
so you need to change that.
Please try:
df = df.append(pd.DataFrame(Added_rows).T)
if you create the dataframe and you set the columns the way you want then append will comply to this
df = pd.DataFrame(columns = ['d','b','a','c'])
df = df.append({'d': name,
'b': h,
'a': w,
'c': normal_size
}
, ignore_index=True,sort=False)
I want to append (merge) all the csv files in a folder using Python pandas.
For example: Say folder has two csv files test1.csv
and test2.csv
as follows:
A_Id P_Id CN1 CN2 CN3
AAA 111 702 709 740
BBB 222 1727 1734 1778
and
A_Id P_Id CN1 CN2 CN3
CCC 333 710 750 750
DDD 444 180 734 778
So the python script I wrote was as follows:
#!/usr/bin/python
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("testfolder/*.csv"):
df = pd.read_csv(f)
all_data = all_data.append(df)
all_data.to_csv('testfolder/combined.csv')
Though the combined.csv
seems to have all the appended rows, it looks as follows:
CN1 CN2 CN3 A_Id P_Id
0 710 750 750 CCC 333
1 180 734 778 DDD 444
0 702 709 740 AAA 111
1 1727 1734 1778 BBB 222
Where as it should look like this:
A_ID P_Id CN1 CN2 CN2
AAA 111 702 709 740
BBB 222 1727 1734 1778
CCC 333 110 356 123
DDD 444 220 256 223
- Why are the first two columns moved to the end?
- Why is it appending in the first line rather than at the last line?
What am I missing? And how can I get get of 0s and 1s in the first column?
P.S: Since these are large csv files, I thought of using pandas.
I tweaked the code as below. Comments in-line.
#!/usr/bin/python
import pandas as pd
import glob
# Grab all the csv files in the folder to a list.
fileList = glob.glob('input_folder/*.csv')
#Initialize an empty dataframe to grab the csv content.
all_data = pd.DataFrame()
#Initialize an empty list to grab the dataframes.
dfList= []
for files in fileList:
df = pd.read_csv(files, index_col = None, header= False)
dfList.append(df)
#The frames will be in reverse order i.e last read file's content in the begining. So reverse it again
Reversed_dfList = dfList[::-1]
CombinedFrame = pd.concat(Reversed_dfList)
# The "Combined.csv" file will have combination of all the files.
CombinedFrame.to_csv('output_folder/Combined.csv', index=False)
Try this …..
all_data = all_data.append(df)[df.columns.tolist()]
I had the same issue and it was painfull. I managed to solve it by reorganising columns based on source dataframe after it was appended to final dataframe. It would look like this:
#!/usr/bin/python
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("testfolder/*.csv"):
df = pd.read_csv(f)
all_data = all_data.append(df)
all_data = all_data[df.columns]
all_data.to_csv('testfolder/combined.csv')
Since your issue was from almost two years ago, I’m posting solution which worked for me for enyone else who will also face similar issue.
You can use reindex to change to the original order:
all_data = all_data.append(df)
all_data = all_data.reindex(df.columns, axis=1)
I saw this here (more details in the link): https://github.com/pandas-dev/pandas/issues/4588#issuecomment-44421883
Starting from version 0.23.0, you can prevent the append()
method to sort the final appended DataFrame. In your case:
all_data = all_data.append(df, sort=False)
all_data = all_data.append(df, ignore_index=True)
Put ignore_index=True so the order will not change.
By default is False
so you need to change that.
Please try:
df = df.append(pd.DataFrame(Added_rows).T)
if you create the dataframe and you set the columns the way you want then append will comply to this
df = pd.DataFrame(columns = ['d','b','a','c'])
df = df.append({'d': name,
'b': h,
'a': w,
'c': normal_size
}
, ignore_index=True,sort=False)