Pandas labeling in a for loop
Question:
here’s the thing, I want to add a new column as a label for a selection of rows.
when failure is 1, select 2 rows before and 1 after then add a label column. Here is an attempt that I want…
df_new = pd.DataFrame()
for i in range(0, len(df)):
if df.iloc[i]['failure'] == 1:
n += 1
df_new = df_new.append(df.iloc[i-2:i+2])
df_new = df_new.append({'label': n}, ignore_index=True)```
The result of that:
var_1 | var_2 | failure | label
------------------------------------
0 75.0 | 55.0 | 0.0 | NaN
------------------------------------
1 45.0 | 19.0 | 0.0 | NaN
------------------------------------
2 76.0 | 46.0 | 1.0 | NaN
------------------------------------
3 18.0 | 63.0 | 0.0 | NaN
------------------------------------
4 NaN | NaN | NaN | 1.0
------------------------------------
But I want...
var_1 | var_2 | failure | label
------------------------------------
0 75.0 | 55.0 | 0.0 | 1
------------------------------------
1 45.0 | 19.0 | 0.0 | 1
------------------------------------
2 76.0 | 46.0 | 1.0 | 1
------------------------------------
3 18.0 | 63.0 | 0.0 | 1
------------------------------------
Answers:
Instead of a for loop, a more pandas
approach would be to first compute the sum as a series, and add it to your frame with a condition.
For example, signal = df['failure'].rolling(window=4).sum().shift(-3)
(You’ll want to double check the shift offset to make sure it’s what you intend).
Then you can create df['label'] = np.where(signal == 1, 1, 0)
.
Does that fit what you need?
For dataset:
dataset with 10,000 rows and 6 columns of random data between 0 and 100 (inclusive) and last column is a random number intiger between 0 and 1
df = pd.DataFrame(np.random.randint(0, 100, size=(10000, 6)), columns=['a', 'b', 'c', 'd', 'e', 'f'])
df['g'] = np.random.randint(0, 2, size=10000)
df.columns = [ 'var_' + str(i) for i in range(1, 7) ] + ['failure']
df['failure'] = np.random.binomial(1, 0.1, size=10000)
when failure is 1, select 2 rows before and 1 after then add a label column
n = 0
df_new = pd.DataFrame()
for i in range(0, len(df)):
if df.iloc[i]['failure'] == 1:
n += 1
df_new = df_new.append(df.iloc[i-2:i+2])
df_new = df_new.append({'label': n}, ignore_index=True)
df_new['label'].fillna(method='bfill', inplace=True)
df_new.dropna(inplace=True)
here’s the thing, I want to add a new column as a label for a selection of rows.
when failure is 1, select 2 rows before and 1 after then add a label column. Here is an attempt that I want…
df_new = pd.DataFrame()
for i in range(0, len(df)):
if df.iloc[i]['failure'] == 1:
n += 1
df_new = df_new.append(df.iloc[i-2:i+2])
df_new = df_new.append({'label': n}, ignore_index=True)```
The result of that:
var_1 | var_2 | failure | label
------------------------------------
0 75.0 | 55.0 | 0.0 | NaN
------------------------------------
1 45.0 | 19.0 | 0.0 | NaN
------------------------------------
2 76.0 | 46.0 | 1.0 | NaN
------------------------------------
3 18.0 | 63.0 | 0.0 | NaN
------------------------------------
4 NaN | NaN | NaN | 1.0
------------------------------------
But I want...
var_1 | var_2 | failure | label
------------------------------------
0 75.0 | 55.0 | 0.0 | 1
------------------------------------
1 45.0 | 19.0 | 0.0 | 1
------------------------------------
2 76.0 | 46.0 | 1.0 | 1
------------------------------------
3 18.0 | 63.0 | 0.0 | 1
------------------------------------
Instead of a for loop, a more pandas
approach would be to first compute the sum as a series, and add it to your frame with a condition.
For example, signal = df['failure'].rolling(window=4).sum().shift(-3)
(You’ll want to double check the shift offset to make sure it’s what you intend).
Then you can create df['label'] = np.where(signal == 1, 1, 0)
.
Does that fit what you need?
For dataset:
dataset with 10,000 rows and 6 columns of random data between 0 and 100 (inclusive) and last column is a random number intiger between 0 and 1
df = pd.DataFrame(np.random.randint(0, 100, size=(10000, 6)), columns=['a', 'b', 'c', 'd', 'e', 'f'])
df['g'] = np.random.randint(0, 2, size=10000)
df.columns = [ 'var_' + str(i) for i in range(1, 7) ] + ['failure']
df['failure'] = np.random.binomial(1, 0.1, size=10000)
when failure is 1, select 2 rows before and 1 after then add a label column
n = 0
df_new = pd.DataFrame()
for i in range(0, len(df)):
if df.iloc[i]['failure'] == 1:
n += 1
df_new = df_new.append(df.iloc[i-2:i+2])
df_new = df_new.append({'label': n}, ignore_index=True)
df_new['label'].fillna(method='bfill', inplace=True)
df_new.dropna(inplace=True)