Pandas sum across columns and divide each cell from that value

Question:

I have read a csv file and pivoted it to get to following structure:

pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0) 

Section of the result:

             0     1     2    3     4    5   6  7    8   9  10  11  12  13  group
user_id                                                                      
2        33653  2325   916  720   867  187  31  0    6   3  42  56  92  15    l-1
4        18895   414  1116  570  1190   55  92  0  122  23  78   6   4   2    l-2 
16        1383    70    27   17    17    1   0  0    0   0   1   0   0   0    l-2
50         396    72    34    5    18    0   0  0    0   0   0   0   0   0    l-3
51        3915  1170   402  832  2791  316  12  5  118  51  32   9  62  27    l-4

I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a ‘pandas’ way?

Asked By: add-semi-colons

||

Answers:

The following seemed to work fine for me:

In [39]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols]  = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result

Out[39]:
                0         1         2         3         4         5         6  
user_id                                                                         
2        0.864827  0.059749  0.023540  0.018503  0.022280  0.004806  0.000797   
4        0.837285  0.018345  0.049453  0.025258  0.052732  0.002437  0.004077   
16       0.912269  0.046174  0.017810  0.011214  0.011214  0.000660  0.000000   
50       0.754286  0.137143  0.064762  0.009524  0.034286  0.000000  0.000000   
51       0.401868  0.120099  0.041265  0.085403  0.286491  0.032437  0.001232   

                7         8         9        10        11        12        13  
user_id                                                                         
2        0.000000  0.000154  0.000077  0.001079  0.001439  0.002364  0.000385   
4        0.000000  0.005406  0.001019  0.003456  0.000266  0.000177  0.000089   
16       0.000000  0.000000  0.000000  0.000660  0.000000  0.000000  0.000000   
50       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
51       0.000513  0.012113  0.005235  0.003285  0.000924  0.006364  0.002772   

        group  
user_id        
2         l-1  
4         l-2  
16        l-2  
50        l-3  
51        l-4  

OK scratch the above, the following will be much faster:

result[cols]  = result[cols].div(result[cols].sum(axis=1), axis=0)

And just to prove the result is the same:

In [47]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True

And that it’s faster:

In [48]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0) 
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop
Answered By: EdChum

Try the following:

In [1]: import pandas as pd

In [2]: df = pd.read_csv("test.csv")

In [3]: df
Out[3]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [4]: df["sum"] = df.sum(axis=1)

In [5]: df
Out[5]: 
  id  value1  value2  value3  sum
0  A       1       2       3    6
1  B       4       5       6   15
2  C       7       8       9   24

In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [7]: df_new
Out[7]: 
     value1    value2  value3
0  0.166667  0.333333   0.500
1  0.266667  0.333333   0.400
2  0.291667  0.333333   0.375

Or you can do the following:

In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [9]: df
Out[9]: 
  id    value1    value2  value3  sum
0  A  0.166667  0.333333   0.500    6
1  B  0.266667  0.333333   0.400   15
2  C  0.291667  0.333333   0.375   24

Or just straight up from the beginning:

In [10]: df = pd.read_csv("test.csv")

In [11]: df
Out[11]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)

In [13]: df
Out[13]: 
  id    value1    value2  value3
0  A  0.166667  0.333333   0.500
1  B  0.266667  0.333333   0.400
2  C  0.291667  0.333333   0.375

Changing the column value1 and the like to your headers should work similarly.

Answered By: WitchGod

More simply:

result.div(result.sum(axis=1), axis=0)
Answered By: Souf Ee

easier to work per column:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T

result:

         0         1      2
0  0.166667  0.333333  0.500
1  0.266667  0.333333  0.400
2  0.291667  0.333333  0.375
Answered By: ihadanny

result.iloc[:,:-1].div(result.iloc[:,:-1].sum(axis=1), axis=0)

result.iloc[:,:-1] gets all rows and columns except last column

result.iloc[:,:-1].sum(axis=1) sums across a row due to axis=1, default is axis=0 i.e. column

result.div(result, axis=0) axis=0 because default for div is column i.e. axis=1

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.