Aggregating df columns but not duplicates

Question:

Is there a neat way to aggregate columns into a new column without duplicating information?

For example, if I have a df:

     Description  Information
  0       text1     text1
  1       text2     text3
  2       text4     text5

And I want to create a new column called ‘Combined’, which aggregates ‘Description’ and ‘Information’ to get:

     Description  Information  Combined
  0       text1     text1        text1
  1       text2     text3      text2 text3
  2       text4     text5      text4 text5

So far I have been using np.where and [mask] to check for duplicates before aggregating with df['Combined'] = df[['Description', 'Information']].agg(' '.join, axis=1)

Although this works, it is not practical on a larger scale, grateful if anyone knows of a simpler way!

Asked By: Jon

||

Answers:

You can first run unique:

df['Combined'] = (df[['Description', 'Information']]
                  .agg(lambda x: ' '.join(x.unique()), axis=1)
                 )

Output:

  Description Information     Combined
0       text1       text1        text1
1       text2       text3  text2 text3
2       text4       text5  text4 text5
Answered By: mozway

Add all of the first column and add the second column only where they differ:

In [17]: df
Out[17]:
    a   b
0  t1  t1
1  t2  t3
2  t4  t5

In [18]: df['c'] = df['a']

In [19]: df
Out[19]:
    a   b   c
0  t1  t1  t1
1  t2  t3  t2
2  t4  t5  t4

In [20]: df.loc[df.a!=df.b, 'c'] = df['c'] + ' ' + df['b']

In [21]: df
Out[21]:
    a   b      c
0  t1  t1     t1
1  t2  t3  t2 t3
2  t4  t5  t4 t5
Answered By: inspectorG4dget

You can do something like this:

import pandas as pd


def combine(row):
    if row['Description'] == row['Information']:
        return row['Description']
    else:
        return row['Description'] + ' ' + row['Information']

df = pd.read_csv('test2.csv')
df['new'] = df.apply(combine, axis=1)
print(df)


  Description Information          new
0       text1       text1        text1
1       text2       text3  text2 text3
2       text4       text5  text4 text5
Answered By: NYC Coder

You can use numpy.where:

import numpy as np

df['Combined'] = np.where(df['Description'] != df['Information'], 
                          df['Information'] + ' ' + df['Description'], 
                          df['Information'])
print(df)

# Output
  Description Information     Combined
0       text1       text1        text1
1       text2       text3  text3 text2
2       text4       text5  text5 text4
Answered By: Corralien

Another possible solution, based on list comprehension:

df['Combined'] = [x[0] + ' ' + x[1] if x[0] != x[1] else x[0]
                  for x in zip(df.Description, df.Information)]

Output:

  Description Information     Combined
0       text1       text1        text1
1       text2       text3  text2 text3
2       text4       text5  text4 text5
Answered By: PaulS
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.