How to compare two columns in Excel using Python?

Question:

I have to excel files with the following fields
file1
col1,col2,col3,col4,col5,col6,col7,col8,col9

server1,java_yes,….
server2,java_no,….
server4,java_no,….
server8,java_no,….

file2
col1,col2,col3,col4,col5,col6,col7,col8,col9

server1,java_yes,….
server3,java_no,….
server4,java_yes,….
server8,java_no,….

I want to
a. Iterate over file1
b. Compare each entry in col1 in file1 against col1 in file2
c. If it exists, I want to see if the value in file1->col2 matches the entry in file2->col2
d. If file1->col2 does not match file2->col2 then I want to update file1->col2 to equal file2->col2

Update

Running in strange issue and providing the details here.
It works fine for most of the entries but for some entries it displays NaN even though the dataframe has java_yes in both places.
To figure this out, I added a filter and then printed it at various stages.
When I print for df1, df2 and merged it works fine.
When I print the same at the very end, it displays NaN for certain entries
Very strange.

my_filter = ( df1['col1'] == 'server1' )
print(df1.loc(my_filter, 'col2')

All except the last print returns

Yes

The very last print (for df1) returns

NaN
Asked By: user1074593

||

Answers:

assuming that you have file called workbook.xlsx containing 2 sheets (i.e. sheet1, sheet2)
firstly you can access it using code like this..

import pandas as pd
df1 = pd.read_excel("..workbook.xlsx", sheet_name= "sheet1")
df2 = pd.read_excel("..workbook.xlsx", sheet_name= "sheet2")

now df1 represents the first sheet, df2 represent the second sheet.

you can iterate through df1 on a column name "col1" to check the condition and update your new data frames using this code..

for i in range(len(df1["col1"])):
    if (df1["col1"][i] == df2["col1"][i]) and (df1["col2"][i] != df2["col2"][i]):
        df1.at[i,"col2"] = df2["col2"][i]

But this will check the associate value on the same row number only.
if you need to check if the Sheet1->col1 value exists in any of Sheet2->col1 values you can use this loop instead will achieve the same result.

for i in range(len(df1["col1"])):
    if (df1["col1"][i] in df2["col1"].values):
        j = np.where(df2["col1"] == df1["col1"][i])[0]
        df1.at[i,"col2"] = df2["col2"][j]

Finally to store your result into a new excel workbook you can use..

with pd.ExcelWriter('New_result.xlsx') as writer:
    df1.to_excel(writer, sheet_name="Sheet1")
    df2.to_excel(writer, sheet_name="Sheet2")

This will guarantee you to match all values from Sheet1->col2 with Sheet2->col2 as long as Sheet1->col1 == Sheet2->col1

Answered By: Khaled Sayed

You can achieve that using pandas:

First, read the files using pd.read_excel (or pd.read_csv)

import pandas as pd

df1 = pd.read_excel("path/to/file1.xlsx")
df2 = pd.read_excel("path/to/file2.xlsx")

From the example you provide, you should have something like that:

df1

col1 col2
0 server1 java_yes
1 server2 java_no
2 server4 java_no
3 server8 java_no

df2

col1 col2
0 server1 java_yes
1 server3 java_no
2 server4 java_yes
3 server8 java_no

Now merge df2 into df1 on col1 in left mode, and overwrite df1["col2"] accordingly

merged =  df1.merge(df2, on="col1", how="left")
df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])

Resulting df1 is:

col1 col2
0 server1 java_yes
1 server2 java_no
2 server4 java_yes
3 server8 java_no

EDIT: explaining the merge part

merged =  df1.merge(df2, on="col1", how="left")

This line merges df2 on df1 based on the values in "col1" column.

how="left" is used to specify that we want to keep all col1 values from df1, even the ones that don’t exist in df2. I’ll let you check the DataFrame.merge doc for more details.

The same column names in df1 and df2 will be renamed with the default suffix: _x and _y.

For the rows where the col1 value does not exist in df2, the values in the other columns will be NaN.

Here is what merged looks like:

col1 col2_x col2_y
0 server1 java_yes java_yes
1 server2 java_no nan
2 server4 java_no java_yes
3 server8 java_no java_no

From here, we want the final col2 in df1 to be:

  • col2_y (ie. col2 from df2) when it’s not NaN (i.e when col1 value was in df2),
  • otherwise col2_x (i.e col2 from df1).

In other words, we want col2_y after replacing all NaN values with the corresponding col2_x value. This is what the fillna statement does.

df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])
Answered By: thmslmr
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.