Add missing columns to the dataframes from other dataframes

Question:

I have a list of dataframes

dfA:

item   a     A              
A      1     2 
B      1     3         
C      0     4         

dfB:

item   a     B
E      1     2
F      0     6

dfC:

item   a     C
G      1     3
H      0     4

I want to add the missing columns to each dataframe

This is what I want :
dfA:

item   a     A    B    C           
A      1     2    0    0
B      1     3    0    0 
C      0     4    0    0 

dfB:

item   a     A   B    C
E      1     0   2    0
F      0     0   6    0

dfC:

item   a     A   B   C
G      1     0   0   3
H      0     0   0   4
Asked By: niukasu

||

Answers:

You can create a combined column list like this

col_list = (df1.append([df2,df3])).columns.tolist()

Now add the columns to each dataframe

df1 = df1.loc[:, col_list].fillna(0)
print(df1)

    A   B   C   a   item
0   2   0.0 0.0 1   A
1   3   0.0 0.0 1   B
2   4   0.0 0.0 0   C


df2 = df2.loc[:, col_list].fillna(0)
print(df2)

    A   B   C   a   item
0   0.0 2   0.0 1   E
1   0.0 6   0.0 0   F

df3 = df3.loc[:, col_list].fillna(0)
print(df3)

    A   B   C   a   item
0   0.0 0.0 3   1   G
1   0.0 0.0 4   0   H
Answered By: Vaishali

One way using merge by defining the order of dfA, dfB, dfC in reduce operation.

In [1932]: reduce(lambda l,r: pd.merge(l,r,on=['item', 'a'], how='left'),
                              [dfA, dfB, dfC]).fillna(0)
Out[1932]:
  item  a  A    B    C
0    A  1  2  0.0  0.0
1    B  1  3  0.0  0.0
2    C  0  4  0.0  0.0

In [1933]: reduce(lambda l,r: pd.merge(l,r,on=['item', 'a'], how='left'), 
                  [dfB, dfA, dfC]).fillna(0)
Out[1933]:
  item  a  B    A    C
0    E  1  2  0.0  0.0
1    F  0  6  0.0  0.0

In [1934]: reduce(lambda l,r: pd.merge(l,r,on=['item', 'a'], how='left'),
                  [dfC, dfA, dfB]).fillna(0)
Out[1934]:
  item  a  C    A    B
0    G  1  3  0.0  0.0
1    H  0  4  0.0  0.0
Answered By: Zero

Option 1
Align both axes
With functools.partial

from functool import partial

(_, dfA), (dfC, dfB) = list(map(
    partial(dfC.align, fill_value=0),
    dfA.align(dfB, fill_value=0)
))

Option 1B
Align columns only

from functools import partial

(_, dfA), (dfC, dfB) = list(map(
    partial(dfC.align, fill_value=0, axis=1),
    dfA.align(dfB, fill_value=0, axis=1)
))

Option 2
Align both axes
With pd.DataFrame.reindex

from functools import reduce    

lod = [dfA, dfB, dfC]
idx = reduce(pd.Index.union, (d.index for d in lod))
col = reduce(pd.Index.union, (d.columns for d in lod))
dfA, dfB, dfC = (d.reindex(idx, col, fill_value=0) for d in lod)

Option 2B
Align columns only

lod = [dfA, dfB, dfC]
col = reduce(pd.Index.union, (d.columns for d in lod))
dfA, dfB, dfC = (d.reindex(columns=col, fill_value=0) for d in lod)

Setup

dfA = pd.DataFrame(**{
    'columns': ['item', 'a', 'A'],
    'data': [['A', 1, 2], ['B', 1, 3], ['C', 0, 4]],
    'index': [0, 1, 2]})

dfB = pd.DataFrame(**{
    'columns': ['item', 'a', 'B'],
    'data': [['E', 1, 2], ['F', 0, 6]],
    'index': [0, 1]})

dfC = pd.DataFrame(**{
    'columns': ['item', 'a', 'C'],
    'data': [['G', 1, 3], ['H', 0, 4]],
    'index': [0, 1]})
Answered By: piRSquared

1) Take the union of each dataframe’s columns.

col_list = list(set().union(dfA.columns, dfB.columns, dfC.columns))
col_list.sort()
['A', 'B', 'C', 'a']

2) Use the reindex function.

dfA2 = dfA.reindex(columns=col_list, fill_value=0)
   A  B  C  a
A  2  0  0  1
B  3  0  0  1
C  4  0  0  0

dfB2 = dfB.reindex(columns=col_list, fill_value=0)
   A  B  C  a
E  0  2  0  1
F  0  6  0  0

dfC2 = dfC.reindex(columns=col_list, fill_value=0)
   A  B  C  a
G  0  0  3  1
H  0  0  4  0

3) You can use reindex to drop, add, or duplicate columns.

dfA3 = dfA.reindex(columns=['C', 'A', 'A', 'D'], fill_value=0)
   C  A  A  D
A  0  2  2  0
B  0  3  3  0
C  0  4  4  0
Answered By: Kevin Lin

You can use columns union and difference:

dfA.loc[:, dfB.columns.union(dfC.columns).difference(dfA.columns)] = 0
print(dfA)

Output:

      a  A  B  C
item            
A     1  2  0  0
B     1  3  0  0
C     0  4  0  0

You can also use a for loop:

dfs = [dfA, dfB, dfC]
all_columns = set().union(*[df.columns for df in dfs])
for df in dfs:
    df.loc[:, all_columns.difference(df.columns)] = 0
    print(df, end='nn')

Output:

      a  A  B  C
item            
A     1  2  0  0
B     1  3  0  0
C     0  4  0  0

      a  B  A  C
item            
E     1  2  0  0
F     0  6  0  0

      a  C  A  B
item            
G     1  3  0  0
H     0  4  0  0
Answered By: Mykola Zotko
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.