Fill non-consecutive missings with consecutive numbers

Question:

For a given data frame…

data = pd.DataFrame([[1., 6.5], [1., np.nan],[5, 3], [6.5, 3.], [2, np.nan]])

that looks like this…

    0       1
0   1.0     6.5
1   1.0     NaN
2   5.0     3.0
3   6.5     3.0
4   2.0     NaN

…I want to create a third column where all missings of the second column are replaced with consecutive numbers. So the result should look like this:

    0       1     2
0   1.0     6.5   NaN
1   1.0     NaN   1
2   5.0     3.0   NaN
3   6.5     3.0   NaN
4   2.0     NaN   2

(my data frame has much more rows, so imagine 70 missings in the second column so that the last number in the 3rd column would be 70)

How can I create the 3rd column?

Asked By: tobip

||

Answers:

You can do it this way, I took the liberty of renaming the columns to avoid the confusion of what I am selecting, you can do the same with your dataframe using:

data = data.rename(columns={0:'a',1:'b'})

In [41]:

data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
Out[41]:
     a    b   c
0  1.0  6.5 NaN
1  1.0  NaN   1
2  5.0  3.0 NaN
3  6.5  3.0 NaN
4  2.0  NaN   2

[5 rows x 3 columns]

Some explanation here of the one liner:

# we want just the rows where column 'b' is null:
data[data.b.isnull()]

# now construct a dataset of the length of this dataframe starting from 1:
range(1,len(data[data.b.isnull()]) + 1) # note we have to add a 1 at the end

# construct a new dataframe from this and crucially use the index of the null values:
pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index)

# now perform a merge and tell it we want to perform a left merge and use both sides indices, I've removed the verbose dataframe construction and replaced with new_df here but you get the point
data.merge(new_df,how='left', left_index=True, right_index=True)

Edit

You can also do it another way using @Karl.D’s suggestion:

In [56]:

data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
data
Out[56]:
     a    b   c
0  1.0  6.5 NaN
1  1.0  NaN   1
2  5.0  3.0 NaN
3  6.5  3.0 NaN
4  2.0  NaN   2

[5 rows x 3 columns]

Timings also suggest that Karl’s method would be faster for larger datasets but I would profile this:

In [57]:

%timeit data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
%timeit data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
1000 loops, best of 3: 1.31 ms per loop
1000 loops, best of 3: 501 µs per loop
Answered By: EdChum
def function1(dd:pd.DataFrame):
    return dd.assign(col2=(dd['1'].isna()).cumsum()) if dd.name else dd
df1.groupby(df1['1'].isna()).apply(function1)

out:

  0    1  col2
0  1.0  6.5   NaN
1  1.0  NaN   1.0
2  5.0  3.0   NaN
3  6.5  3.0   NaN
4  2.0  NaN   2.0
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.