Merge two rows in the same Dataframe if their index is the same?
Question:
I have created a large Dataframe by pulling data from an Azure database. The construction of the dataframe wasn’t simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database.
This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index. Is it possible for me to merge lines with the same index? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.
In summary:
This
Col1 Col2
2015-10-27 22:22:31 1400
2015-10-27 22:22:31 50.5
To this
Col1 Col2
2015-10-27 22:22:31 1400 50.5
I have tried using the groupby function on index but that just messed up. Most of the data columns disappeared and a few very large numbers were spat out.
Note:
The data is in this sort of format, except with many more columns and is generally quite sparse!
Col1 Col2 ... Col_n-1 Col_n
2015-10-27 21:15:60+0 1220
2015-10-27 21:25:4+0 1420
2015-10-27 21:28:8+0 1410
2015-10-27 21:37:10+0 51.5
2015-10-27 21:37:11+0 1500
2015-10-27 21:46:14+0 51
2015-10-27 21:46:15+0 1390
2015-10-27 21:55:19+0 1370
2015-10-27 22:04:24+0 1450
2015-10-27 22:13:28+0 1350
2015-10-27 22:22:31+0 1400
2015-10-27 22:22:31+0 50.5
2015-10-27 22:25:33+0 1300
2015-10-27 22:29:42+0 ... 1900
2015-10-27 22:29:42+0 63
2015-10-27 22:34:36+0 1280
Answers:
For anyone interested – I ended up writing my own function to:
- go through dataframe
- taking note of rows that need merging by taking note of the indexes
- aggregate or average the values across all rows
- delete all but one row of each set that needed merging replacing its values with the aggregations or averages (depending on what I needed)
code:
def groupDataOnTimeBlock(data, timeBlock_type, timeBlock_factor):
'''
Filter Dataframe to merge lines which are within the same time block.
i.e. being part of the same x number of seconds, weeks, months...
data:
Dataframe to filter.
timeBlock_type:
Time period with which to group data rows. This can be data per:
SECONDS, DAYS, MILLISECONDS
timeBlock_factor:
Number of timeBlock types to group on.
'''
pd.options.mode.chained_assignment = None # default='warn'
tBt = timeBlock_type.upper()
tBf = timeBlock_factor
if tBt == 'SEC' or tBt == 'SECOND' or tBt == 'SECONDS':
roundType = 'SECONDS'
elif tBt == 'MINS' or tBt == 'MINUTES' or tBt == 'MIN':
roundType = 'MINUTES'
elif tBt == 'MILLI' or tBt == 'MILLISECONDS':
roundType = 'MILLISECONDS'
elif tBt == 'WEEK' or tBt == 'WEEKS':
roundType = 'WEEKS'
else:
raise ValueError ('Invalid time block type entered')
numElements = len(data.columns)
anchorValue = timeStampReformat(data.iloc[1,len(data.columns)-7], roundType, tBf)
delIndex = []
mergeCount = 0
av_agg_arr = np.zeros([1,numElements], dtype=float)
#Cycling through dataframe to get averages and note which rows to delete
for i, row in data.iterrows(): #i is the index value, from 0
backDate = timeStampReformat(row['Timestamp'], roundType, tBf)
data.loc[i,'Timestamp'] = backDate #can be done better. Not all rows need updating.
if (backDate > anchorValue): #if data should be grouped
delIndex.pop() #remove last index as this is the final row to use
delIndex.append(i) #add current row so that it isnt missed.
print('collate')
if mergeCount != 0:
av_agg_arr = av_agg_arr/mergeCount
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
data.iloc[i-1, idx] = av_agg_arr[0, idx] #configure previous (index i -1) row. This is the last of the prior datetime group
anchorValue = backDate
mergeCount = 0
# Re-initialising aggregates and passing in current row values.
av_agg_arr = av_agg_arr - av_agg_arr
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
else: #else if data is still part of same datetime group
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
mergeCount += 1
delIndex.append(i) #picking out index value of row
data.drop(data.index[delIndex], inplace=True) #delete all flagged rows
data.reset_index()
pd.options.mode.chained_assignment = 'warn' # default='warn'
return data
Building up on @EdChum ‘s answer, it is also possible to use the min_count
parameter of groupBy.sum
to manage NaN values in different ways. Let’s say we have an additional row to the example:
Col1 Col2
2015-10-27 22:22:31 1400 NaN
2015-10-27 22:22:31 NaN 50.5
2022-08-02 16:00:00 1600 NaN
then,
In [184]:
df.groupby('index').sum(min_count=1)
Out[184]:
Col1 Col2
index
2015-10-27 22:22:31 1400 50.5
2022-08-02 16:00:00 1600 NaN
Using min_count=0
will output 0 instead of NaN values.
I have created a large Dataframe by pulling data from an Azure database. The construction of the dataframe wasn’t simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database.
This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index. Is it possible for me to merge lines with the same index? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.
In summary:
This
Col1 Col2
2015-10-27 22:22:31 1400
2015-10-27 22:22:31 50.5
To this
Col1 Col2
2015-10-27 22:22:31 1400 50.5
I have tried using the groupby function on index but that just messed up. Most of the data columns disappeared and a few very large numbers were spat out.
Note:
The data is in this sort of format, except with many more columns and is generally quite sparse!
Col1 Col2 ... Col_n-1 Col_n
2015-10-27 21:15:60+0 1220
2015-10-27 21:25:4+0 1420
2015-10-27 21:28:8+0 1410
2015-10-27 21:37:10+0 51.5
2015-10-27 21:37:11+0 1500
2015-10-27 21:46:14+0 51
2015-10-27 21:46:15+0 1390
2015-10-27 21:55:19+0 1370
2015-10-27 22:04:24+0 1450
2015-10-27 22:13:28+0 1350
2015-10-27 22:22:31+0 1400
2015-10-27 22:22:31+0 50.5
2015-10-27 22:25:33+0 1300
2015-10-27 22:29:42+0 ... 1900
2015-10-27 22:29:42+0 63
2015-10-27 22:34:36+0 1280
For anyone interested – I ended up writing my own function to:
- go through dataframe
- taking note of rows that need merging by taking note of the indexes
- aggregate or average the values across all rows
- delete all but one row of each set that needed merging replacing its values with the aggregations or averages (depending on what I needed)
code:
def groupDataOnTimeBlock(data, timeBlock_type, timeBlock_factor):
'''
Filter Dataframe to merge lines which are within the same time block.
i.e. being part of the same x number of seconds, weeks, months...
data:
Dataframe to filter.
timeBlock_type:
Time period with which to group data rows. This can be data per:
SECONDS, DAYS, MILLISECONDS
timeBlock_factor:
Number of timeBlock types to group on.
'''
pd.options.mode.chained_assignment = None # default='warn'
tBt = timeBlock_type.upper()
tBf = timeBlock_factor
if tBt == 'SEC' or tBt == 'SECOND' or tBt == 'SECONDS':
roundType = 'SECONDS'
elif tBt == 'MINS' or tBt == 'MINUTES' or tBt == 'MIN':
roundType = 'MINUTES'
elif tBt == 'MILLI' or tBt == 'MILLISECONDS':
roundType = 'MILLISECONDS'
elif tBt == 'WEEK' or tBt == 'WEEKS':
roundType = 'WEEKS'
else:
raise ValueError ('Invalid time block type entered')
numElements = len(data.columns)
anchorValue = timeStampReformat(data.iloc[1,len(data.columns)-7], roundType, tBf)
delIndex = []
mergeCount = 0
av_agg_arr = np.zeros([1,numElements], dtype=float)
#Cycling through dataframe to get averages and note which rows to delete
for i, row in data.iterrows(): #i is the index value, from 0
backDate = timeStampReformat(row['Timestamp'], roundType, tBf)
data.loc[i,'Timestamp'] = backDate #can be done better. Not all rows need updating.
if (backDate > anchorValue): #if data should be grouped
delIndex.pop() #remove last index as this is the final row to use
delIndex.append(i) #add current row so that it isnt missed.
print('collate')
if mergeCount != 0:
av_agg_arr = av_agg_arr/mergeCount
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
data.iloc[i-1, idx] = av_agg_arr[0, idx] #configure previous (index i -1) row. This is the last of the prior datetime group
anchorValue = backDate
mergeCount = 0
# Re-initialising aggregates and passing in current row values.
av_agg_arr = av_agg_arr - av_agg_arr
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
else: #else if data is still part of same datetime group
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
mergeCount += 1
delIndex.append(i) #picking out index value of row
data.drop(data.index[delIndex], inplace=True) #delete all flagged rows
data.reset_index()
pd.options.mode.chained_assignment = 'warn' # default='warn'
return data
Building up on @EdChum ‘s answer, it is also possible to use the min_count
parameter of groupBy.sum
to manage NaN values in different ways. Let’s say we have an additional row to the example:
Col1 Col2
2015-10-27 22:22:31 1400 NaN
2015-10-27 22:22:31 NaN 50.5
2022-08-02 16:00:00 1600 NaN
then,
In [184]:
df.groupby('index').sum(min_count=1)
Out[184]:
Col1 Col2
index
2015-10-27 22:22:31 1400 50.5
2022-08-02 16:00:00 1600 NaN
Using min_count=0
will output 0 instead of NaN values.