Fill non-consecutive missings with consecutive numbers
Question:
For a given data frame…
data = pd.DataFrame([[1., 6.5], [1., np.nan],[5, 3], [6.5, 3.], [2, np.nan]])
that looks like this…
0 1
0 1.0 6.5
1 1.0 NaN
2 5.0 3.0
3 6.5 3.0
4 2.0 NaN
…I want to create a third column where all missings of the second column are replaced with consecutive numbers. So the result should look like this:
0 1 2
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
(my data frame has much more rows, so imagine 70 missings in the second column so that the last number in the 3rd column would be 70)
How can I create the 3rd column?
Answers:
You can do it this way, I took the liberty of renaming the columns to avoid the confusion of what I am selecting, you can do the same with your dataframe using:
data = data.rename(columns={0:'a',1:'b'})
In [41]:
data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
Out[41]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Some explanation here of the one liner:
# we want just the rows where column 'b' is null:
data[data.b.isnull()]
# now construct a dataset of the length of this dataframe starting from 1:
range(1,len(data[data.b.isnull()]) + 1) # note we have to add a 1 at the end
# construct a new dataframe from this and crucially use the index of the null values:
pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index)
# now perform a merge and tell it we want to perform a left merge and use both sides indices, I've removed the verbose dataframe construction and replaced with new_df here but you get the point
data.merge(new_df,how='left', left_index=True, right_index=True)
Edit
You can also do it another way using @Karl.D’s suggestion:
In [56]:
data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
data
Out[56]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Timings also suggest that Karl’s method would be faster for larger datasets but I would profile this:
In [57]:
%timeit data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
%timeit data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
1000 loops, best of 3: 1.31 ms per loop
1000 loops, best of 3: 501 µs per loop
def function1(dd:pd.DataFrame):
return dd.assign(col2=(dd['1'].isna()).cumsum()) if dd.name else dd
df1.groupby(df1['1'].isna()).apply(function1)
out:
0 1 col2
0 1.0 6.5 NaN
1 1.0 NaN 1.0
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2.0
For a given data frame…
data = pd.DataFrame([[1., 6.5], [1., np.nan],[5, 3], [6.5, 3.], [2, np.nan]])
that looks like this…
0 1
0 1.0 6.5
1 1.0 NaN
2 5.0 3.0
3 6.5 3.0
4 2.0 NaN
…I want to create a third column where all missings of the second column are replaced with consecutive numbers. So the result should look like this:
0 1 2
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
(my data frame has much more rows, so imagine 70 missings in the second column so that the last number in the 3rd column would be 70)
How can I create the 3rd column?
You can do it this way, I took the liberty of renaming the columns to avoid the confusion of what I am selecting, you can do the same with your dataframe using:
data = data.rename(columns={0:'a',1:'b'})
In [41]:
data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
Out[41]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Some explanation here of the one liner:
# we want just the rows where column 'b' is null:
data[data.b.isnull()]
# now construct a dataset of the length of this dataframe starting from 1:
range(1,len(data[data.b.isnull()]) + 1) # note we have to add a 1 at the end
# construct a new dataframe from this and crucially use the index of the null values:
pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index)
# now perform a merge and tell it we want to perform a left merge and use both sides indices, I've removed the verbose dataframe construction and replaced with new_df here but you get the point
data.merge(new_df,how='left', left_index=True, right_index=True)
Edit
You can also do it another way using @Karl.D’s suggestion:
In [56]:
data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
data
Out[56]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Timings also suggest that Karl’s method would be faster for larger datasets but I would profile this:
In [57]:
%timeit data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
%timeit data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
1000 loops, best of 3: 1.31 ms per loop
1000 loops, best of 3: 501 µs per loop
def function1(dd:pd.DataFrame):
return dd.assign(col2=(dd['1'].isna()).cumsum()) if dd.name else dd
df1.groupby(df1['1'].isna()).apply(function1)
out:
0 1 col2
0 1.0 6.5 NaN
1 1.0 NaN 1.0
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2.0