Changing column values in python
Question:
I have a dataframe of shape:
Col1
Col2
0.3
1
0.22
0
0.89
0
0.12
1
0.54
0
0.11
1
Assume that this dataset is sorted based on time and df.iloc[1]
is before df.iloc[2]
. Also assume that Col2 is binary.
What i would like to do is change the value of each Col2 sample as follows:
df.iloc[i][‘Col2’] is 1 if any of the next 2 samples is 1 in the dataframe, else it is 0. Leave the last 2 elements of the dataframe unchanged
For example the result here would be:
Col1
Col2
0.3
0
0.22
1
0.89
1
0.12
1
0.54
1
0.11
1
What i have done so far:
for i, j in df.iterrows():
if i<df.shape[0]-2:
df.iloc[i]['Col2'] = max([df.iloc[j]['Col2'] for j in range(i,i+2)])
I think the code works correctly but since my dataset is very large it takes too much time to run. Is there a more elegant and computationally friendly solution?
Answers:
Yes, there is a more efficient way to achieve the same result using the rolling and max functions in pandas. Here’s an example:
import pandas as pd
# Create the sample dataframe
data = {'Col1': [0.3, 0.22, 0.89, 0.12, 0.54, 0.11], 'Col2': [1, 0, 0, 1, 0 ,1]}
df = pd.DataFrame(data)
# Use rolling and max functions to update Col2
df['Col2'] = df['Col2'].rolling(3).max().shift(-2).fillna(df['Col2'])
print(df)
This code creates a rolling window of size 3 on column Col2, takes the maximum value within each window and shifts the resulting series up by 2 rows to align with your desired output. The last two elements of the original column are filled in using the fillna function.
This approach should be much faster than using a for loop on large datasets.
I have a dataframe of shape:
Col1 | Col2 |
---|---|
0.3 | 1 |
0.22 | 0 |
0.89 | 0 |
0.12 | 1 |
0.54 | 0 |
0.11 | 1 |
Assume that this dataset is sorted based on time and df.iloc[1]
is before df.iloc[2]
. Also assume that Col2 is binary.
What i would like to do is change the value of each Col2 sample as follows:
df.iloc[i][‘Col2’] is 1 if any of the next 2 samples is 1 in the dataframe, else it is 0. Leave the last 2 elements of the dataframe unchanged
For example the result here would be:
Col1 | Col2 |
---|---|
0.3 | 0 |
0.22 | 1 |
0.89 | 1 |
0.12 | 1 |
0.54 | 1 |
0.11 | 1 |
What i have done so far:
for i, j in df.iterrows():
if i<df.shape[0]-2:
df.iloc[i]['Col2'] = max([df.iloc[j]['Col2'] for j in range(i,i+2)])
I think the code works correctly but since my dataset is very large it takes too much time to run. Is there a more elegant and computationally friendly solution?
Yes, there is a more efficient way to achieve the same result using the rolling and max functions in pandas. Here’s an example:
import pandas as pd
# Create the sample dataframe
data = {'Col1': [0.3, 0.22, 0.89, 0.12, 0.54, 0.11], 'Col2': [1, 0, 0, 1, 0 ,1]}
df = pd.DataFrame(data)
# Use rolling and max functions to update Col2
df['Col2'] = df['Col2'].rolling(3).max().shift(-2).fillna(df['Col2'])
print(df)
This code creates a rolling window of size 3 on column Col2, takes the maximum value within each window and shifts the resulting series up by 2 rows to align with your desired output. The last two elements of the original column are filled in using the fillna function.
This approach should be much faster than using a for loop on large datasets.