Make multiple means from one column with pandas

Question

With Python 3.10

Sample data:

import pandas as pd

data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
        [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
        [87,10026, 54]]  
df = pd.DataFrame(data, columns=['column', 'data', 'something'])  
print(df)  
df = df.mask(df == 0).fillna(df.mean())  
print(df)    # <---this works but you will see what I mean about looking off..

Updated Solution:
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = round(df['data'].rolling(4, 1).apply(lambda x: np.nanmean(x)), 2)
df['final2'] = np.where(df['data'] > 0, df['data'], df['ma'])
print(df)
# it replaces the zeros and NULLS with a value, (sometimes it fits well, sometimes, not so much).

The idea is I have one or more column(s) with bad or missing data.

If I use .fillna(df.mean()) for this it sticks out like a sore thumb.
My Goal is to have a percentage of the total number of elements in the dataframe column to make the new mean from…
I would like to take a len(df)*0.30 (30%) and use divide it in half.
I would collect half the numbers above the index point where the (null/0/bad data) data exists.
I would collect half the numbers below the index where the
These collected elements would be the be used to calculate the missing or bad index point.

This would be more helpful if there were a data set that irregular or had missing bad data

Asked By: Kevin

||

Source

Answer 1

You can take a rolling mean with min periods = 1 to smooth out the data.
or you can do a variant of this method to customise what you want.

Inside the lambda i used this np.nanmean(x).

import pandas as pd
import numpy as np

data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
        [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
        [87,10026, 54]]  
df = pd.DataFrame(data, columns=['column', 'data', 'something'])  
df['ma'] = df['data'].rolling(3,1).apply(lambda x : np.nanmean(x))
df['final'] = np.where(df['data'] >= 0, df['data'], df['ma'])

print(df)

result:

    column     data  something            ma    final
0        1  14890.0          3  14890.000000  14890.0
1        4      5.0          6   7447.500000      5.0
2        7      8.0          9   4967.666667      8.0
3       11     13.0         14      8.666667     13.0
4       12      0.0         18      7.000000      0.0
5       87      NaN         54      6.500000      6.5
6        1      0.0          3      0.000000      0.0
7        4      5.0          6      2.500000      5.0
8        7      8.0          9      4.333333      8.0
9       11     13.0         14      8.666667     13.0
10      12      0.0         18      7.000000      0.0
11      87      NaN         54      6.500000      6.5
12       1      0.0          3      0.000000      0.0
13       4      5.0          6      2.500000      5.0
14       7      8.0          9      4.333333      8.0
15      11     13.0         14      8.666667     13.0
16      12      0.0         18      7.000000      0.0
17      87  10026.0         54   3346.333333  10026.0

Answered By: D.L

Make multiple means from one column with pandas

Question:

With Python 3.10

Sample data:

The idea is I have one or more column(s) with bad or missing data.

If I use .fillna(df.mean()) for this it sticks out like a sore thumb.

Answers: