python pandas complex merging of two dataframes
Question:
I have an interval (let’s say from 0 to 45) and I split it up based on the change in the value. The problem is that I have 2 values (value1 and value2) that I am trying to split the graph based on them and then join them by creating more point splits and giving them a value (see the examples)
I have two pandas dataframes as follows:
From1
To1
Value1
0
3
1.
3
15
2.
15
30
1.
30
45
3.
From2
To2
Value2
0
5
b)
5
11
a)
11
30
c)
30
45
a)
I would like to join them to get something like this:
From
To
Value1
Value2
0
3
1.
b)
3
5
2.
b)
5
11
2.
a)
11
15
2.
c)
15
30
1.
c)
30
45
3.
a)
I tried to get all values from columns: From1 and From2 and create From column, but I don’t know how to continue.
Answers:
You can create individual rows for each step (here considering 1), then use a double groupby.agg
:
def reindex_int(df):
tmp = df.loc[df.index.repeat(df['To'].sub(df['From']))]
s = tmp.groupby(level=0).cumcount()
tmp['From'] += s
tmp['To'] = tmp['From']+1
return tmp
out = (pd.concat([reindex_int(df1.rename(columns={'From1': 'From', 'To1': 'To'})),
reindex_int(df2.rename(columns={'From2': 'From', 'To2': 'To'}))])
.groupby(['From', 'To'], as_index=False).first()
.pipe(lambda d: d.groupby(d[['Value1', 'Value2']]
.ne(d[['Value1', 'Value2']].shift())
.any(axis=1).cumsum())
.agg({'From': 'min', 'To': 'max',
'Value1': 'first', 'Value2': 'first'})
)
)
Output:
From To Value1 Value2
1 0 3 1.0 b)
2 3 5 2.0 b)
3 5 11 2.0 a)
4 11 15 2.0 c)
5 15 30 1.0 c)
6 30 45 3.0 a)
Intermediate:
reindex_int(df1.rename(columns={'From1': 'From', 'To1': 'To'}))
From To Value1
0 0 1 1
0 1 2 1
0 2 3 1
1 3 4 2
1 4 5 2
1 5 6 2
1 6 7 2
1 7 8 2
1 8 9 2
1 9 10 2
1 10 11 2
1 11 12 2
1 12 13 2
1 13 14 2
1 14 15 2
2 15 16 1
2 16 17 1
2 17 18 1
2 18 19 1
2 19 20 1
2 20 21 1
2 21 22 1
2 22 23 1
2 23 24 1
2 24 25 1
2 25 26 1
2 26 27 1
2 27 28 1
2 28 29 1
2 29 30 1
3 30 31 3
3 31 32 3
3 32 33 3
3 33 34 3
3 34 35 3
3 35 36 3
3 36 37 3
3 37 38 3
3 38 39 3
3 39 40 3
3 40 41 3
3 41 42 3
3 42 43 3
3 43 44 3
3 44 45 3
Here is an alternative way:
ndf = (pd.merge(df.assign(t = [range(s,e+1) for s,e in zip(df['From1'],df['To1'])]).explode('t'),
df2.assign(t = [range(s,e+1) for s,e in zip(df2['From2'],df2['To2'])])
.explode('t')))
ndf = (ndf.groupby(['Value1','Value2'],sort=False)
.agg(From = ('t','first'),To = ('t','last'))
.drop_duplicates(keep=False)
.reset_index()))
Output:
Value1 Value2 From To
0 1.0 b) 0 3
1 2.0 b) 3 5
2 2.0 a) 5 11
3 2.0 c) 11 15
4 1.0 c) 15 30
5 3.0 a) 30 45
I have an interval (let’s say from 0 to 45) and I split it up based on the change in the value. The problem is that I have 2 values (value1 and value2) that I am trying to split the graph based on them and then join them by creating more point splits and giving them a value (see the examples)
I have two pandas dataframes as follows:
From1 | To1 | Value1 |
---|---|---|
0 | 3 | 1. |
3 | 15 | 2. |
15 | 30 | 1. |
30 | 45 | 3. |
From2 | To2 | Value2 |
---|---|---|
0 | 5 | b) |
5 | 11 | a) |
11 | 30 | c) |
30 | 45 | a) |
I would like to join them to get something like this:
From | To | Value1 | Value2 |
---|---|---|---|
0 | 3 | 1. | b) |
3 | 5 | 2. | b) |
5 | 11 | 2. | a) |
11 | 15 | 2. | c) |
15 | 30 | 1. | c) |
30 | 45 | 3. | a) |
I tried to get all values from columns: From1 and From2 and create From column, but I don’t know how to continue.
You can create individual rows for each step (here considering 1), then use a double groupby.agg
:
def reindex_int(df):
tmp = df.loc[df.index.repeat(df['To'].sub(df['From']))]
s = tmp.groupby(level=0).cumcount()
tmp['From'] += s
tmp['To'] = tmp['From']+1
return tmp
out = (pd.concat([reindex_int(df1.rename(columns={'From1': 'From', 'To1': 'To'})),
reindex_int(df2.rename(columns={'From2': 'From', 'To2': 'To'}))])
.groupby(['From', 'To'], as_index=False).first()
.pipe(lambda d: d.groupby(d[['Value1', 'Value2']]
.ne(d[['Value1', 'Value2']].shift())
.any(axis=1).cumsum())
.agg({'From': 'min', 'To': 'max',
'Value1': 'first', 'Value2': 'first'})
)
)
Output:
From To Value1 Value2
1 0 3 1.0 b)
2 3 5 2.0 b)
3 5 11 2.0 a)
4 11 15 2.0 c)
5 15 30 1.0 c)
6 30 45 3.0 a)
Intermediate:
reindex_int(df1.rename(columns={'From1': 'From', 'To1': 'To'}))
From To Value1
0 0 1 1
0 1 2 1
0 2 3 1
1 3 4 2
1 4 5 2
1 5 6 2
1 6 7 2
1 7 8 2
1 8 9 2
1 9 10 2
1 10 11 2
1 11 12 2
1 12 13 2
1 13 14 2
1 14 15 2
2 15 16 1
2 16 17 1
2 17 18 1
2 18 19 1
2 19 20 1
2 20 21 1
2 21 22 1
2 22 23 1
2 23 24 1
2 24 25 1
2 25 26 1
2 26 27 1
2 27 28 1
2 28 29 1
2 29 30 1
3 30 31 3
3 31 32 3
3 32 33 3
3 33 34 3
3 34 35 3
3 35 36 3
3 36 37 3
3 37 38 3
3 38 39 3
3 39 40 3
3 40 41 3
3 41 42 3
3 42 43 3
3 43 44 3
3 44 45 3
Here is an alternative way:
ndf = (pd.merge(df.assign(t = [range(s,e+1) for s,e in zip(df['From1'],df['To1'])]).explode('t'),
df2.assign(t = [range(s,e+1) for s,e in zip(df2['From2'],df2['To2'])])
.explode('t')))
ndf = (ndf.groupby(['Value1','Value2'],sort=False)
.agg(From = ('t','first'),To = ('t','last'))
.drop_duplicates(keep=False)
.reset_index()))
Output:
Value1 Value2 From To
0 1.0 b) 0 3
1 2.0 b) 3 5
2 2.0 a) 5 11
3 2.0 c) 11 15
4 1.0 c) 15 30
5 3.0 a) 30 45