Pandas groupby, resample, return NaN not 0

Question:

I have the following dataframe:

data = {"timestamp": ["2022-12-15 22:00:00", "2022-12-15 22:00:30", "2022-12-15 22:00:47", 
                        "2022-12-15 22:00:03", "2022-12-15 22:00:30", "2022-12-15 22:00:43", 
                        "2022-12-15 22:00:10", "2022-12-15 22:00:34", "2022-12-15 22:00:59"],
        "ID": ["A","A","A",
                "B", "B", "B",
                "C", "C", "C"],
        "value": [11, 0, 0,
                    7, 5, 7,
                    0, 3.4, 3.4]
    }

df_test = pd.DataFrame(data, columns=["timestamp", "ID", "value"])
df_test["timestamp"] = pd.to_datetime(df_test["timestamp"])

I want to create a new dataframe which for every ID has a row for every second from "2022-12-15 22:00:00" to "2022-12-15 22:01:00" in the same dataframe. So the end dataframe will have 180 rows (60 for each ID, so each rows is one second in the timeinterval.). For the rows which match the timestamp in df_test I want the value and otherwise I want a NaN value.

I have tried using the following code:

df_resampled = df_test.groupby("ID").resample("S", on="timestamp").sum().reset_index()

But this have the problem that for rows which do not match, 0 is returned instead of NaN.

Asked By: andKaae

||

Answers:

The "value" issue itself could be fixed as follows:

res = (df_test.set_index('timestamp')
       .groupby('ID')
       .resample('S')
       .asfreq()['value']
       .reset_index())

res.shape
# (139, 3) N.B. Wrong start and end!

However, this won’t solve another problem that consists of the fact that a simple resample will start/end with the first/last timestamp for each ID, and in your example these are not always 22:00:00 and 22:00:59.

Here’s an alternative approach:

rng = pd.date_range(start='2022-12-15 22:00:00', end='2022-12-15 22:00:59', 
                    freq="S")

multi_index = pd.MultiIndex.from_product([df_test.ID.unique(), rng],
                                         names=['ID', 'timestamp'])

res = df_test.set_index(['ID','timestamp']).reindex(multi_index).reset_index()

# check result
res[(res.value.notna()) | 
    (res.timestamp.isin(['2022-12-15 22:00:00', '2022-12-15 22:00:59']))]

    ID           timestamp  value
0    A 2022-12-15 22:00:00   11.0
30   A 2022-12-15 22:00:30    0.0
47   A 2022-12-15 22:00:47    0.0
59   A 2022-12-15 22:00:59    NaN
60   B 2022-12-15 22:00:00    NaN
63   B 2022-12-15 22:00:03    7.0
90   B 2022-12-15 22:00:30    5.0
103  B 2022-12-15 22:00:43    7.0
119  B 2022-12-15 22:00:59    NaN
120  C 2022-12-15 22:00:00    NaN
130  C 2022-12-15 22:00:10    0.0
154  C 2022-12-15 22:00:34    3.4
179  C 2022-12-15 22:00:59    3.4

# Note that all the zeros are still there, 
# and that each `ID` starts/ends with the correct timestamp

res.shape
# (180, 3)
Answered By: ouroboros1
  1. Get unique ‘ID’

  2. Make temporary (tmp) DataFrame with first and last time (2022-12-15 22:00:00′, ‘2022-12-15 22:00:59’)

  3. Merge df_test and tmp

  4. Resample

     l = list(df_test['ID'].unique())    
     tmp = pd.DataFrame({'ID': l * 2, 'timestamp': 
           [df_test['timestamp'].iloc[0]]*len(l)
         + [df_test['timestamp'].iloc[-1]]*len(l)})
     tmp = pd.merge_ordered(df_test, tmp, on = ['ID','timestamp'])
     result = (tmp.set_index('timestamp')
          .groupby('ID')
          .resample('S')
          .asfreq()['value']
          .reset_index())
    

result:

0    A 2022-12-15 22:00:00   11.0
1    A 2022-12-15 22:00:01    NaN
2    A 2022-12-15 22:00:02    NaN
3    A 2022-12-15 22:00:03    NaN
4    A 2022-12-15 22:00:04    NaN
5    A 2022-12-15 22:00:05    NaN
6    A 2022-12-15 22:00:06    NaN
7    A 2022-12-15 22:00:07    NaN
8    A 2022-12-15 22:00:08    NaN
9    A 2022-12-15 22:00:09    NaN
10   A 2022-12-15 22:00:10    NaN
11   A 2022-12-15 22:00:11    NaN
12   A 2022-12-15 22:00:12    NaN
13   A 2022-12-15 22:00:13    NaN
14   A 2022-12-15 22:00:14    NaN
15   A 2022-12-15 22:00:15    NaN
16   A 2022-12-15 22:00:16    NaN
17   A 2022-12-15 22:00:17    NaN
18   A 2022-12-15 22:00:18    NaN
19   A 2022-12-15 22:00:19    NaN
20   A 2022-12-15 22:00:20    NaN
21   A 2022-12-15 22:00:21    NaN
22   A 2022-12-15 22:00:22    NaN
23   A 2022-12-15 22:00:23    NaN
24   A 2022-12-15 22:00:24    NaN
25   A 2022-12-15 22:00:25    NaN
26   A 2022-12-15 22:00:26    NaN
27   A 2022-12-15 22:00:27    NaN
28   A 2022-12-15 22:00:28    NaN
29   A 2022-12-15 22:00:29    NaN
30   A 2022-12-15 22:00:30    0.0
31   A 2022-12-15 22:00:31    NaN
32   A 2022-12-15 22:00:32    NaN
33   A 2022-12-15 22:00:33    NaN
34   A 2022-12-15 22:00:34    NaN
35   A 2022-12-15 22:00:35    NaN
36   A 2022-12-15 22:00:36    NaN
37   A 2022-12-15 22:00:37    NaN
38   A 2022-12-15 22:00:38    NaN
39   A 2022-12-15 22:00:39    NaN
40   A 2022-12-15 22:00:40    NaN
41   A 2022-12-15 22:00:41    NaN
42   A 2022-12-15 22:00:42    NaN
43   A 2022-12-15 22:00:43    NaN
44   A 2022-12-15 22:00:44    NaN
45   A 2022-12-15 22:00:45    NaN
46   A 2022-12-15 22:00:46    NaN
47   A 2022-12-15 22:00:47    0.0
48   A 2022-12-15 22:00:48    NaN
49   A 2022-12-15 22:00:49    NaN
50   A 2022-12-15 22:00:50    NaN
51   A 2022-12-15 22:00:51    NaN
52   A 2022-12-15 22:00:52    NaN
53   A 2022-12-15 22:00:53    NaN
54   A 2022-12-15 22:00:54    NaN
55   A 2022-12-15 22:00:55    NaN
56   A 2022-12-15 22:00:56    NaN
57   A 2022-12-15 22:00:57    NaN
58   A 2022-12-15 22:00:58    NaN
59   A 2022-12-15 22:00:59    NaN
60   B 2022-12-15 22:00:00    NaN
61   B 2022-12-15 22:00:01    NaN
62   B 2022-12-15 22:00:02    NaN
63   B 2022-12-15 22:00:03    7.0
64   B 2022-12-15 22:00:04    NaN
65   B 2022-12-15 22:00:05    NaN
66   B 2022-12-15 22:00:06    NaN
67   B 2022-12-15 22:00:07    NaN
68   B 2022-12-15 22:00:08    NaN
69   B 2022-12-15 22:00:09    NaN
70   B 2022-12-15 22:00:10    NaN
71   B 2022-12-15 22:00:11    NaN
72   B 2022-12-15 22:00:12    NaN
73   B 2022-12-15 22:00:13    NaN
74   B 2022-12-15 22:00:14    NaN
75   B 2022-12-15 22:00:15    NaN
76   B 2022-12-15 22:00:16    NaN
77   B 2022-12-15 22:00:17    NaN
78   B 2022-12-15 22:00:18    NaN
79   B 2022-12-15 22:00:19    NaN
80   B 2022-12-15 22:00:20    NaN
81   B 2022-12-15 22:00:21    NaN
82   B 2022-12-15 22:00:22    NaN
83   B 2022-12-15 22:00:23    NaN
84   B 2022-12-15 22:00:24    NaN
85   B 2022-12-15 22:00:25    NaN
86   B 2022-12-15 22:00:26    NaN
87   B 2022-12-15 22:00:27    NaN
88   B 2022-12-15 22:00:28    NaN
89   B 2022-12-15 22:00:29    NaN
90   B 2022-12-15 22:00:30    5.0
91   B 2022-12-15 22:00:31    NaN
92   B 2022-12-15 22:00:32    NaN
93   B 2022-12-15 22:00:33    NaN
94   B 2022-12-15 22:00:34    NaN
95   B 2022-12-15 22:00:35    NaN
96   B 2022-12-15 22:00:36    NaN
97   B 2022-12-15 22:00:37    NaN
98   B 2022-12-15 22:00:38    NaN
99   B 2022-12-15 22:00:39    NaN
100  B 2022-12-15 22:00:40    NaN
101  B 2022-12-15 22:00:41    NaN
102  B 2022-12-15 22:00:42    NaN
103  B 2022-12-15 22:00:43    7.0
104  B 2022-12-15 22:00:44    NaN
105  B 2022-12-15 22:00:45    NaN
106  B 2022-12-15 22:00:46    NaN
107  B 2022-12-15 22:00:47    NaN
108  B 2022-12-15 22:00:48    NaN
109  B 2022-12-15 22:00:49    NaN
110  B 2022-12-15 22:00:50    NaN
111  B 2022-12-15 22:00:51    NaN
112  B 2022-12-15 22:00:52    NaN
113  B 2022-12-15 22:00:53    NaN
114  B 2022-12-15 22:00:54    NaN
115  B 2022-12-15 22:00:55    NaN
116  B 2022-12-15 22:00:56    NaN
117  B 2022-12-15 22:00:57    NaN
118  B 2022-12-15 22:00:58    NaN
119  B 2022-12-15 22:00:59    NaN
120  C 2022-12-15 22:00:00    NaN
121  C 2022-12-15 22:00:01    NaN
122  C 2022-12-15 22:00:02    NaN
123  C 2022-12-15 22:00:03    NaN
124  C 2022-12-15 22:00:04    NaN
125  C 2022-12-15 22:00:05    NaN
126  C 2022-12-15 22:00:06    NaN
127  C 2022-12-15 22:00:07    NaN
128  C 2022-12-15 22:00:08    NaN
129  C 2022-12-15 22:00:09    NaN
130  C 2022-12-15 22:00:10    0.0
131  C 2022-12-15 22:00:11    NaN
132  C 2022-12-15 22:00:12    NaN
133  C 2022-12-15 22:00:13    NaN
134  C 2022-12-15 22:00:14    NaN
135  C 2022-12-15 22:00:15    NaN
136  C 2022-12-15 22:00:16    NaN
137  C 2022-12-15 22:00:17    NaN
138  C 2022-12-15 22:00:18    NaN
139  C 2022-12-15 22:00:19    NaN
140  C 2022-12-15 22:00:20    NaN
141  C 2022-12-15 22:00:21    NaN
142  C 2022-12-15 22:00:22    NaN
143  C 2022-12-15 22:00:23    NaN
144  C 2022-12-15 22:00:24    NaN
145  C 2022-12-15 22:00:25    NaN
146  C 2022-12-15 22:00:26    NaN
147  C 2022-12-15 22:00:27    NaN
148  C 2022-12-15 22:00:28    NaN
149  C 2022-12-15 22:00:29    NaN
150  C 2022-12-15 22:00:30    NaN
151  C 2022-12-15 22:00:31    NaN
152  C 2022-12-15 22:00:32    NaN
153  C 2022-12-15 22:00:33    NaN
154  C 2022-12-15 22:00:34    3.4
155  C 2022-12-15 22:00:35    NaN
156  C 2022-12-15 22:00:36    NaN
157  C 2022-12-15 22:00:37    NaN
158  C 2022-12-15 22:00:38    NaN
159  C 2022-12-15 22:00:39    NaN
160  C 2022-12-15 22:00:40    NaN
161  C 2022-12-15 22:00:41    NaN
162  C 2022-12-15 22:00:42    NaN
163  C 2022-12-15 22:00:43    NaN
164  C 2022-12-15 22:00:44    NaN
165  C 2022-12-15 22:00:45    NaN
166  C 2022-12-15 22:00:46    NaN
167  C 2022-12-15 22:00:47    NaN
168  C 2022-12-15 22:00:48    NaN
169  C 2022-12-15 22:00:49    NaN
170  C 2022-12-15 22:00:50    NaN
171  C 2022-12-15 22:00:51    NaN
172  C 2022-12-15 22:00:52    NaN
173  C 2022-12-15 22:00:53    NaN
174  C 2022-12-15 22:00:54    NaN
175  C 2022-12-15 22:00:55    NaN
176  C 2022-12-15 22:00:56    NaN
177  C 2022-12-15 22:00:57    NaN
178  C 2022-12-15 22:00:58    NaN
179  C 2022-12-15 22:00:59    3.4
Answered By: Evgeny Romensky

This seems to work.

import pandas as pd
from datetime import datetime, timedelta

data = {"timestamp": ["2022-12-15 22:00:00", "2022-12-15 22:00:30", "2022-12-15 22:00:47", 
                    "2022-12-15 22:00:03", "2022-12-15 22:00:30", "2022-12-15 22:00:43", 
                    "2022-12-15 22:00:10", "2022-12-15 22:00:34", "2022-12-15 22:00:59"],
    "ID": ["A","A","A",
            "B", "B", "B",
            "C", "C", "C"],
    "value": [11, 0, 0,
                7, 5, 7,
                0, 3.4, 3.4]
}

df = pd.DataFrame(data, columns=["timestamp", "ID", "value"])
df["timestamp"] = pd.to_datetime(df["timestamp"])

#li = []

time_range = pd.date_range('2022-12-15 22:00:00', '2022-12-15 22:00:59', freq="S")


#for id in df.ID.unique():
#    for i in time_range:
#        li.append({'ID': id, 'timestamp': i})

#df1 = pd.concat([df, pd.DataFrame.from_dict(li)]).drop_duplicates(['ID', 'timestamp']).sort_values(['ID', 'timestamp']).reset_index()

#print (df1)

df = (df.set_index('timestamp')
    .groupby('ID')['value']
    .apply(lambda x: x.reindex(time_range))
    .reset_index())

yields:

     index           timestamp ID  value
0        0 2022-12-15 22:00:00  A   11.0
1        0 2022-12-15 22:00:01  A    NaN
2        1 2022-12-15 22:00:02  A    NaN
3        2 2022-12-15 22:00:03  A    NaN
4        3 2022-12-15 22:00:04  A    NaN
..     ...                 ... ..    ...
173    172 2022-12-15 22:00:55  C    NaN
174    173 2022-12-15 22:00:56  C    NaN
175    174 2022-12-15 22:00:57  C    NaN
176    175 2022-12-15 22:00:58  C    NaN
177      8 2022-12-15 22:00:59  C    3.4
Answered By: gittert
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.