How to compare two columns in Excel using Python?
Question:
I have to excel files with the following fields
file1
col1,col2,col3,col4,col5,col6,col7,col8,col9
server1,java_yes,….
server2,java_no,….
server4,java_no,….
server8,java_no,….
file2
col1,col2,col3,col4,col5,col6,col7,col8,col9
server1,java_yes,….
server3,java_no,….
server4,java_yes,….
server8,java_no,….
I want to
a. Iterate over file1
b. Compare each entry in col1 in file1 against col1 in file2
c. If it exists, I want to see if the value in file1->col2 matches the entry in file2->col2
d. If file1->col2 does not match file2->col2 then I want to update file1->col2 to equal file2->col2
Update
Running in strange issue and providing the details here.
It works fine for most of the entries but for some entries it displays NaN even though the dataframe has java_yes in both places.
To figure this out, I added a filter and then printed it at various stages.
When I print for df1, df2 and merged it works fine.
When I print the same at the very end, it displays NaN for certain entries
Very strange.
my_filter = ( df1['col1'] == 'server1' )
print(df1.loc(my_filter, 'col2')
All except the last print returns
Yes
The very last print (for df1) returns
NaN
Answers:
assuming that you have file called workbook.xlsx containing 2 sheets (i.e. sheet1, sheet2)
firstly you can access it using code like this..
import pandas as pd
df1 = pd.read_excel("..workbook.xlsx", sheet_name= "sheet1")
df2 = pd.read_excel("..workbook.xlsx", sheet_name= "sheet2")
now df1 represents the first sheet, df2 represent the second sheet.
you can iterate through df1 on a column name "col1" to check the condition and update your new data frames using this code..
for i in range(len(df1["col1"])):
if (df1["col1"][i] == df2["col1"][i]) and (df1["col2"][i] != df2["col2"][i]):
df1.at[i,"col2"] = df2["col2"][i]
But this will check the associate value on the same row number only.
if you need to check if the Sheet1->col1 value exists in any of Sheet2->col1 values you can use this loop instead will achieve the same result.
for i in range(len(df1["col1"])):
if (df1["col1"][i] in df2["col1"].values):
j = np.where(df2["col1"] == df1["col1"][i])[0]
df1.at[i,"col2"] = df2["col2"][j]
Finally to store your result into a new excel workbook you can use..
with pd.ExcelWriter('New_result.xlsx') as writer:
df1.to_excel(writer, sheet_name="Sheet1")
df2.to_excel(writer, sheet_name="Sheet2")
This will guarantee you to match all values from Sheet1->col2 with Sheet2->col2 as long as Sheet1->col1 == Sheet2->col1
You can achieve that using pandas
:
First, read the files using pd.read_excel
(or pd.read_csv
)
import pandas as pd
df1 = pd.read_excel("path/to/file1.xlsx")
df2 = pd.read_excel("path/to/file2.xlsx")
From the example you provide, you should have something like that:
df1
col1
col2
0
server1
java_yes
1
server2
java_no
2
server4
java_no
3
server8
java_no
df2
col1
col2
0
server1
java_yes
1
server3
java_no
2
server4
java_yes
3
server8
java_no
Now merge df2
into df1
on col1
in left
mode, and overwrite df1["col2"]
accordingly
merged = df1.merge(df2, on="col1", how="left")
df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])
Resulting df1
is:
col1
col2
0
server1
java_yes
1
server2
java_no
2
server4
java_yes
3
server8
java_no
EDIT: explaining the merge part
merged = df1.merge(df2, on="col1", how="left")
This line merges df2
on df1
based on the values in "col1"
column.
how="left"
is used to specify that we want to keep all col1
values from df1
, even the ones that don’t exist in df2
. I’ll let you check the DataFrame.merge doc for more details.
The same column names in df1
and df2
will be renamed with the default suffix: _x
and _y
.
For the rows where the col1
value does not exist in df2
, the values in the other columns will be NaN
.
Here is what merged
looks like:
col1
col2_x
col2_y
0
server1
java_yes
java_yes
1
server2
java_no
nan
2
server4
java_no
java_yes
3
server8
java_no
java_no
From here, we want the final col2
in df1
to be:
col2_y
(ie. col2
from df2
) when it’s not NaN
(i.e when col1
value was in df2
),
- otherwise
col2_x
(i.e col2
from df1
).
In other words, we want col2_y
after replacing all NaN
values with the corresponding col2_x
value. This is what the fillna
statement does.
df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])
I have to excel files with the following fields
file1
col1,col2,col3,col4,col5,col6,col7,col8,col9
server1,java_yes,….
server2,java_no,….
server4,java_no,….
server8,java_no,….
file2
col1,col2,col3,col4,col5,col6,col7,col8,col9
server1,java_yes,….
server3,java_no,….
server4,java_yes,….
server8,java_no,….
I want to
a. Iterate over file1
b. Compare each entry in col1 in file1 against col1 in file2
c. If it exists, I want to see if the value in file1->col2 matches the entry in file2->col2
d. If file1->col2 does not match file2->col2 then I want to update file1->col2 to equal file2->col2
Update
Running in strange issue and providing the details here.
It works fine for most of the entries but for some entries it displays NaN even though the dataframe has java_yes in both places.
To figure this out, I added a filter and then printed it at various stages.
When I print for df1, df2 and merged it works fine.
When I print the same at the very end, it displays NaN for certain entries
Very strange.
my_filter = ( df1['col1'] == 'server1' ) print(df1.loc(my_filter, 'col2')
All except the last print returns
Yes
The very last print (for df1) returns
NaN
assuming that you have file called workbook.xlsx containing 2 sheets (i.e. sheet1, sheet2)
firstly you can access it using code like this..
import pandas as pd
df1 = pd.read_excel("..workbook.xlsx", sheet_name= "sheet1")
df2 = pd.read_excel("..workbook.xlsx", sheet_name= "sheet2")
now df1 represents the first sheet, df2 represent the second sheet.
you can iterate through df1 on a column name "col1" to check the condition and update your new data frames using this code..
for i in range(len(df1["col1"])):
if (df1["col1"][i] == df2["col1"][i]) and (df1["col2"][i] != df2["col2"][i]):
df1.at[i,"col2"] = df2["col2"][i]
But this will check the associate value on the same row number only.
if you need to check if the Sheet1->col1 value exists in any of Sheet2->col1 values you can use this loop instead will achieve the same result.
for i in range(len(df1["col1"])):
if (df1["col1"][i] in df2["col1"].values):
j = np.where(df2["col1"] == df1["col1"][i])[0]
df1.at[i,"col2"] = df2["col2"][j]
Finally to store your result into a new excel workbook you can use..
with pd.ExcelWriter('New_result.xlsx') as writer:
df1.to_excel(writer, sheet_name="Sheet1")
df2.to_excel(writer, sheet_name="Sheet2")
This will guarantee you to match all values from Sheet1->col2 with Sheet2->col2 as long as Sheet1->col1 == Sheet2->col1
You can achieve that using pandas
:
First, read the files using pd.read_excel
(or pd.read_csv
)
import pandas as pd
df1 = pd.read_excel("path/to/file1.xlsx")
df2 = pd.read_excel("path/to/file2.xlsx")
From the example you provide, you should have something like that:
df1
col1 | col2 | |
---|---|---|
0 | server1 | java_yes |
1 | server2 | java_no |
2 | server4 | java_no |
3 | server8 | java_no |
df2
col1 | col2 | |
---|---|---|
0 | server1 | java_yes |
1 | server3 | java_no |
2 | server4 | java_yes |
3 | server8 | java_no |
Now merge df2
into df1
on col1
in left
mode, and overwrite df1["col2"]
accordingly
merged = df1.merge(df2, on="col1", how="left")
df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])
Resulting df1
is:
col1 | col2 | |
---|---|---|
0 | server1 | java_yes |
1 | server2 | java_no |
2 | server4 | java_yes |
3 | server8 | java_no |
EDIT: explaining the merge part
merged = df1.merge(df2, on="col1", how="left")
This line merges df2
on df1
based on the values in "col1"
column.
how="left"
is used to specify that we want to keep all col1
values from df1
, even the ones that don’t exist in df2
. I’ll let you check the DataFrame.merge doc for more details.
The same column names in df1
and df2
will be renamed with the default suffix: _x
and _y
.
For the rows where the col1
value does not exist in df2
, the values in the other columns will be NaN
.
Here is what merged
looks like:
col1 | col2_x | col2_y | |
---|---|---|---|
0 | server1 | java_yes | java_yes |
1 | server2 | java_no | nan |
2 | server4 | java_no | java_yes |
3 | server8 | java_no | java_no |
From here, we want the final col2
in df1
to be:
col2_y
(ie.col2
fromdf2
) when it’s notNaN
(i.e whencol1
value was indf2
),- otherwise
col2_x
(i.ecol2
fromdf1
).
In other words, we want col2_y
after replacing all NaN
values with the corresponding col2_x
value. This is what the fillna
statement does.
df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])