Pandas: New Column that is division of groups

Question:

I have a pandas dataframe like the following:

Attr1, Attr2, ... , AttrN, Val, Flag
   a1,  b1.1, ... ,    N1, 100,    A
   a2,  b2.1, ... ,    N2, 200,    A 
   a1,  b1.2, ... ,    N1,  20,    B
   a2,  b2.2, ... ,    N2,  50,    B

Basically, the table can be divided in two regions. Flag==A and Flag==B. There is always an A row that corresponds to a B row. Correspond means that a certain subset of Columns AttrX match exactly, here Attr1, However, some Attr, here Attr2 contain floating point values that are not guaranteed to match. Also, there is the column Val containing the actual quantity of interest.

What I now would like to have is a reordering like this:

Attr1, Attr2A, Attr2B, ... , AttrN, Val_A/B
   a1,   b1.1,   b1.2, ... ,    N1,       5
   a2,   b2.1,   b2.2, ... ,    N2,       4

Common Attributes should be merged, differing attributes should get a column for both values of Flag, and the entries of the column Val shall be divided (A/B).

Asked By: Seriously

||

Answers:

one possible way to do this:

# pivot your table
res = pd.pivot_table(
    data=df, 
    index=['Attr1'],
    columns=['Flag'], 
    values=['Attr2','AttrN','Val'],
    aggfunc='first')
# print(res.columns)
# columns are a Multiindex now, looking like this: [('Attr2', 'A'),('Attr2', 'B'),...]
# join it to single level
res.columns= res.columns.map(''.join)

# calculation
res['ValA'] = res['ValA'].div(res['ValB'])

# drop unnecessary column and rename 'ValA'
res = res.drop('ValB',axis=1).rename(columns={'ValA' : 'Val'}).reset_index()

Output res:

  Attr1 Attr2A Attr2B AttrNA AttrNB   Val
0    a1   b1.1   b1.2     N1     N1   5.0
1    a2   b2.1   b2.2     N2     N2   4.0

I assume there is a little mistake in your desired output and the Nth Attr also has column A and B in the end?

EDIT
Explanation on aggfunc: first

When pivoting it is possible that you have multiple values for one field. aggfunc defines how to handle them. If not passing an aggfunc, np.mean is the default, but mean only works for numerical data, so in your data all columns except Val would be missing. Since you don’t have duplicates, first will just get the first (and only one) value of each (and every!) group.

Here is your data with another row added (row 1) for demonstration:

   Attr1  Attr2 AttrN   Val Flag
0     a1   b1.1    N1   100    A
1     a1  b11.1   N11  1001    A
2     a2   b2.1    N2   200    A
3     a1   b1.2    N1    20    B
4     a2   b2.2    N2    50    B

For index=['Attr1'] and columns=['Flag'] you have more than one value now. But in your pivot row 1 never occurs because it is in the same group than row 0 and we only take the first value.
You can try out what happens if you skip aggfunc, or use last instead of first. Maybe it gets clearer then.

I’m not sure if that is totally accurate what I’m saying, but aggfunc takes every function (also custom functions of your own) which works on a DataFrame with the condition that the function reduces a group of values to a single output value (see this question for more details)

Answered By: Rabinzel
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.