Python Pandas – drop duplicates from dataframe and merge the columns value

Question:

I am trying to remove duplicates from my Dataframe and save their data into the columns where they are NA/Empty.

Example:
I’ve the following DATAFRAME and I would like to remove all the duplicates in column A but merge the values from the rest of the tables

A B C D E
1 X
2 X
2 X
2 X
3 X
3 X
2 X

The expected output:

A B C D E
1 X
2 X X X X
3 X X

How can I perform the above dynamically?

Thanks in advance for the answers

Asked By: Darmon

||

Answers:

Managed to achieve this by grouping the DataFrame by column A, and then aggregating the values of the other columns using a custom function that concatenates the non-null values

Here is the code:

import pandas as pd

# Define a custom function to concatenate non-null values
def concat_non_null(x):
    return ' '.join(filter(lambda v: pd.notnull(v), x))

# Define the DataFrame
df = pd.DataFrame({
    'A': [1, 2, 2, 2, 3, 3, 2],
    'B': ['X', 'X', None, None, None, None, None],
    'C': [None, None, 'X', None, 'X', None, None],
    'D': [None, None, None, 'X', None, 'X', None],
    'E': [None, None, None, None, None, None, 'X']
})

# Group the DataFrame by column A and aggregate the other columns using the custom function
df_agg = df.groupby('A').agg({
    'B': concat_non_null,
    'C': concat_non_null,
    'D': concat_non_null,
    'E': concat_non_null
}).reset_index()

# Print the result
print(df_agg)
Answered By: Darmon

You can use groupby_first because it compute the first non-null entry of each column.:

>>> df.groupby('A', as_index=False).first()
   A     B     C     D     E
0  1     X  None  None  None
1  2     X     X     X     X
2  3  None     X     X  None
Answered By: Corralien