How to quickly fillna with a sequence

Question:

I have a question about how to quickly fillna with a sequence in Python(pandas).I have a dataset like following(the true dataset is longer),

Time Number
t0 NA
t1 NA
t2 NA
t3 0
t4 NA
t5 NA
t6 NA
t7 NA
t8 0
t9 NA

My requirement is to add numbers to N lines before and after non-blank lines, and the sequence range is range(-N,N+1).The interval between any two non-empty rows in the dataset is greater than C(constant), our N will be less than C, so there is no need to consider the coverage problem for the time being. Assuming N=2, the result I need is as follows :

Time Number
t0 NA
t1 -2
t2 -1
t3 0
t4 1
t5 2
t6 -2
t7 -1
t8 0
t9 1

At present, the only way I can think of is to use a loop, but the efficiency is low. Does pandas have such a method to do it quickly?

Asked By: Lucio

||

Answers:

Could we have a bit more information, please?

I notice in your end result t0 is still NA, why is that? Do you still want to keep some values as NaN?

In the example, t3 was 0 originally, and in the result it was still 0, which made sense within the sequence. But what if it was already populated, and not with 0 but another value like 5? Would you want to keep the 5 or assign the 0 from the sequence in that position? And if you kept the 5, would then t4 be 0?

Here’s a few solutions just in case:

To fill in the NaN values with a sequence (skipping populated values)

#!/usr/bin/env python 
import pandas as pd

df = pd.DataFrame([
    {'Time': 't0', 'Number': None},
    {'Time': 't1', 'Number': None},
    {'Time': 't2', 'Number': None},
    {'Time': 't3', 'Number': 0},
    {'Time': 't4', 'Number': None},
    {'Time': 't5', 'Number': None},
])

len_null = df['Number'].isna().sum()
starting_number = -2

df.loc[df['Number'].isna(), 'Number'] = list(range(starting_number, len_null-abs(starting_number)))

print(df)

Will give you this:

  Time  Number
0   t0    -2.0
1   t1    -1.0
2   t2     0.0
3   t3     0.0
4   t4     1.0
5   t5     2.0

To fill in the entire column with a sequence (overwriting populated values)

#!/usr/bin/env python

import pandas as pd

df = pd.DataFrame([
    {'Time': 't0', 'Number': None},
    {'Time': 't1', 'Number': None},
    {'Time': 't2', 'Number': None},
    {'Time': 't3', 'Number': 0},
    {'Time': 't4', 'Number': None},
    {'Time': 't5', 'Number': None},
])

starting_number = -2
df['Number'] = list(range(starting_number, len(df)-abs(starting_number)))
print(df)

Will give you:

  Time  Number
0   t0      -2
1   t1      -1
2   t2       0
3   t3       1
4   t4       2
5   t5       3

To fill in the NAN values with a sequence (skipping the sequence where values are populated)

#!/usr/bin/env python

import pandas as pd

df = pd.DataFrame([
    {'Time': 't0', 'Number': None},
    {'Time': 't1', 'Number': None},
    {'Time': 't2', 'Number': None},
    {'Time': 't3', 'Number': 0},
    {'Time': 't4', 'Number': None},
    {'Time': 't5', 'Number': None},
])

starting_number = -2
new_series = pd.Series(list(range(starting_number, len(df)-abs(starting_number))))
df['Number'] = df['Number'].fillna(new_series)
print(df)

Will give you:

  Time  Number
0   t0    -2.0
1   t1    -1.0
2   t2     0.0
3   t3     0.0
4   t4     2.0
5   t5     3.0

(Notice that 1 would have gone into the t3 slot, but since there was a 0 there already, it just skipped it)

There are still some unknowns in your question, like what happens if the intervals overlap. Here I will consider that a further interval overwrites the previous one (you can do the other way around with a change of code, see second part).

Using rolling, groupby.cumcount, and a mask:

s = df['Number'].notna().shift(-N, fill_value=False)
m = s.rolling(2*N+1, min_periods=1).max().astype(bool)

df['Number2'] = df.groupby(s.cumsum()).cumcount().sub(N).where(m)

NB. I used a slightly different example to show the overlap.

output:

   Time  Number  Number2
 0   t0     NaN      NaN
 1   t1     NaN     -2.0
 2   t2     NaN     -1.0
 3   t3     0.0      0.0
 4   t4     NaN      1.0
 5   t5     NaN     -2.0  # here we have an overlap, use latter value
 6   t6     NaN     -1.0
 7   t7     0.0      0.0
 8   t8     NaN      1.0
 9   t9     NaN      2.0
10  t10     NaN      NaN

priority on first group

s = df['Number'].notna().shift(N, fill_value=False)[::-1]
m = s.rolling(2*N+1, min_periods=1).max().astype(bool)

df['Number3'] = df.groupby(s.cumsum()).cumcount(ascending=False).rsub(N).where(m)

output:

   Time  Number  Number2  Number3
0    t0     NaN      NaN      NaN
1    t1     NaN     -2.0     -2.0
2    t2     NaN     -1.0     -1.0
3    t3     0.0      0.0      0.0
4    t4     NaN      1.0      1.0
5    t5     NaN     -2.0      2.0  # difference in behavior
6    t6     NaN     -1.0     -1.0
7    t7     0.0      0.0      0.0
8    t8     NaN      1.0      1.0
9    t9     NaN      2.0      2.0
10  t10     NaN      NaN      NaN
Answered By: mozway
def function1(dd:pd.DataFrame):
    if dd.iloc[2,1]==0:
        df1.loc[dd.index,'Number']=range(-2,len(dd)-2)

n=2
pd.Series(df1.rolling(2*(n+1),center=True)).apply(function1).pipe(lambda dd:df1)
        
        output
        
             Time  Number
            0   t0     NaN
            1   t1    -2.0
            2   t2    -1.0
            3   t3     0.0
            4   t4     1.0
            5   t5     2.0
            6   t6    -2.0
            7   t7    -1.0
            8   t8     0.0
            9   t9     1.0
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.