Group and create three new columns by condition [Low, Hit, High]

Question

I have a large dataset (~5 Mio rows) with results from a Machine Learning training. Now I want to check to see if the results hit the "target range" or not. Lets say this range contains all values between -0.25 and +0.25. If it’s inside this range, it’s a Hit, if it’s below Low and on the other side High.

I now would create this three columns Hit, Low, High and calculate for each row which condition applies and put a 1 into this col, the other two would become 0. After that I would group the values and sum them up. But I suspect there must be a better and faster way, such as calculate it directly while grouping.

Data

import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

+----+--------+---------+
|    | Type   |   Value |
|----+--------+---------|
|  0 | RF     |    -1.5 | <- Low
|  1 | RF     |    -0.1 | <- Hit
|  2 | RF     |     1.7 | <- High
|  3 | MLP    |     0.2 | <- Hit
|  4 | MLP    |    -0.7 | <- Low
|  5 | MLP    |    -0.6 | <- Low
+----+--------+---------+

Expected Output

pd.DataFrame({"Type":["RF", "MLP"], "Low":[1,2], "Hit":[1,1], "High":[1,0]})

+----+--------+-------+-------+--------+
|    | Type   |   Low |   Hit |   High |
|----+--------+-------+-------+--------|
|  0 | RF     |     1 |     1 |      1 |
|  1 | MLP    |     2 |     1 |      0 |
+----+--------+-------+-------+--------+

Asked By: Marco_CH

||

Source

Answer 1

You could use cut to define the groups and pivot_table to reshape:

(df.assign(group=pd.cut(df['Value'],
                        [float('-inf'), -0.25, 0.25, float('inf')],
                        labels=['Low', 'Hit', 'High']))
   .pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
   .reset_index()
   .rename_axis(None, axis=1)
)

Or crosstab:

(pd.crosstab(df['Type'],
             pd.cut(df['Value'],
                    [float('-inf'), -0.25, 0.25, float('inf')],
                    labels=['Low', 'Hit', 'High'])
             )
   .reset_index().rename_axis(None, axis=1)
 )

output:

  Type  Low  Hit  High
0  MLP    2    1     0
1   RF    1    1     1

Answered By: mozway

Answer 2

You can assign it with np.select then crosstab

c1 = df.Value<=-0.25
c2 = df.Value>=0.25
out = pd.crosstab(df['Type'], np.select([c1,c2], ['Low','High'], default='Hit'))
out
Out[32]: 
col_0  High  Hit  Low
Type                 
MLP       0    1    2
RF        1    1    1

Answered By: BENY

Answer 3

you can try this:

# Your code
import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

# Set your range 
RANGE_MIN = -0.25
RANGE_MAX = 0.25

# --- define functions to be applied to df ---
# evaluate if value is a low
def eval_low(value):
    if value < RANGE_MIN:
        return 1
    else:
        return 0

# evaluate if value is a high
def eval_high(value):
    if value > RANGE_MAX:
        return 1
    else:
        return 0

# evaluate if value is a hit
def eval_hit(value):
    if value >= RANGE_MIN and value <= RANGE_MAX:
        return 1
    else:
        return 0

# Evaluate the functions in new columns
df['Low'] = df.Value.apply(eval_low)
df['Hit'] = df.Value.apply(eval_hit)
df['High'] = df.Value.apply(eval_high)

# get the summary
df.groupby('Type').sum()

Answered By: Davide Laghi

Answer 4

df.assign(Value=pd.cut(df.Value,[-np.inf,-0.25,0.25,np.inf],labels=['Low','Hit','High']))
    .groupby('Type').value_counts().reset_index()
    .pivot_table(index='Type',columns='Value',values=0)

Value  Low  Hit  High
Type                 
MLP      2    1     0
RF       1    1     1

Answered By: G.G

Answer 5

import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

df ['target'] = df.agg({'Value' : lambda x: 'hit' if abs(x)<0.25 else 'low' if x < -0.25 else 'high'})
df['Value'] = 1 

r = df.groupby(['Type', 'target']).count().unstack().fillna(0)
r.columns = list(r.columns.droplevel(0))

print(r)

Result

      high  hit  low
Type                
MLP    0.0  1.0  2.0
RF     1.0  1.0  1.0

Answered By: Laurent B.

Answer 6

Here is a way using pd.cut() and .str.get_dummies()

(pd.cut(df.set_index('Type')['Value'],bins = [-np.inf,-.25,.25,np.inf],labels = ['low','hit','high'])
.str.get_dummies()
.groupby(level=0).sum()
.reset_index())

Output:

  Type  high  hit  low
0  MLP     0    1    2
1   RF     1    1    1

Answered By: rhug123

Group and create three new columns by condition [Low, Hit, High]

Question:

Data

Expected Output

Answers: