Comparing two sequence columns, and base on a condition add element to Dataframe
Question:
I’m looking for a script in Python that can help me with the following problem:
Having the following two columns, I need to create gaps between the sequences in order to make them match:
Input
Output
Index
column A
column B
column A
column B
0
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
3
3
3
3
4
4
3
3
5
5
4
4
4
6
5
5
5
5
7
6
5
5
5
8
8
6
6
6
9
8
8
8
8
10
9
8
8
8
11
10
9
9
9
12
11
9
9
13
11
10
10
10
14
15
13
11
15
16
13
11
16
16
14
13
17
17
14
13
18
17
15
14
19
18
15
14
20
19
16
15
15
21
21
16
15
22
22
17
16
16
23
27
17
16
16
24
17
17
25
17
17
26
18
27
19
28
21
29
22
30
27
I tried but my logic doesn’t work
I have tried different things using Pandas and Python, first I tried converting the columns into a list and iterate them one by one but didn’t work, my closest approach is this one, but unfortunately still not working:
for i in df.index:
if(df['column A'][i] != df['column B'][i]):
df['column A'] = df['column A'][:i] + np.NaN + df['column A'][i:]
#df['column A'][i] = df['column A'].append(pd.Series([np.NaN]))
#df2['column A'] = df['column A'].loc[i] = np.NaN
Answers:
It’s often a bad idea to change an object while iterating over it. Instead, just initialize two new lists as empty and fill them up with the values from the original columns or NaN, as appropriate. The trick is to iterate over the indices of the columns A and B separately, so that you can increment only one of them when you fill in an NaN value in the other list:
a = df['column A'].values
b = df['column B'].values
a_out = []
b_out = []
i = 0
j = 0
while i < len(df) and j < len(df):
if a[i] == b[j]:
a_out.append(a[i])
i += 1
b_out.append(b[j])
j += 1
elif a[i] < b[j]:
a_out.append(a[i])
i += 1
b_out.append(np.nan)
else:
a_out.append(np.nan)
b_out.append(b[j])
j += 1
if i < j:
a_out.extend(a[i:])
b_out.extend([np.nan] * len(a[i:]))
elif i > j:
b_out.extend(b[j:])
a_out.extend([np.nan] * len(b[j:]))
df_out = pd.DataFrame({'column A': a_out,
'column B': b_out})
I’m looking for a script in Python that can help me with the following problem:
Having the following two columns, I need to create gaps between the sequences in order to make them match:
Input | Output | |||
---|---|---|---|---|
Index | column A | column B | column A | column B |
0 | 1 | 1 | 1 | 1 |
1 | 2 | 2 | 2 | 2 |
2 | 2 | 2 | 2 | 2 |
3 | 3 | 3 | 3 | 3 |
4 | 4 | 3 | 3 | |
5 | 5 | 4 | 4 | 4 |
6 | 5 | 5 | 5 | 5 |
7 | 6 | 5 | 5 | 5 |
8 | 8 | 6 | 6 | 6 |
9 | 8 | 8 | 8 | 8 |
10 | 9 | 8 | 8 | 8 |
11 | 10 | 9 | 9 | 9 |
12 | 11 | 9 | 9 | |
13 | 11 | 10 | 10 | 10 |
14 | 15 | 13 | 11 | |
15 | 16 | 13 | 11 | |
16 | 16 | 14 | 13 | |
17 | 17 | 14 | 13 | |
18 | 17 | 15 | 14 | |
19 | 18 | 15 | 14 | |
20 | 19 | 16 | 15 | 15 |
21 | 21 | 16 | 15 | |
22 | 22 | 17 | 16 | 16 |
23 | 27 | 17 | 16 | 16 |
24 | 17 | 17 | ||
25 | 17 | 17 | ||
26 | 18 | |||
27 | 19 | |||
28 | 21 | |||
29 | 22 | |||
30 | 27 |
I tried but my logic doesn’t work
I have tried different things using Pandas and Python, first I tried converting the columns into a list and iterate them one by one but didn’t work, my closest approach is this one, but unfortunately still not working:
for i in df.index:
if(df['column A'][i] != df['column B'][i]):
df['column A'] = df['column A'][:i] + np.NaN + df['column A'][i:]
#df['column A'][i] = df['column A'].append(pd.Series([np.NaN]))
#df2['column A'] = df['column A'].loc[i] = np.NaN
It’s often a bad idea to change an object while iterating over it. Instead, just initialize two new lists as empty and fill them up with the values from the original columns or NaN, as appropriate. The trick is to iterate over the indices of the columns A and B separately, so that you can increment only one of them when you fill in an NaN value in the other list:
a = df['column A'].values
b = df['column B'].values
a_out = []
b_out = []
i = 0
j = 0
while i < len(df) and j < len(df):
if a[i] == b[j]:
a_out.append(a[i])
i += 1
b_out.append(b[j])
j += 1
elif a[i] < b[j]:
a_out.append(a[i])
i += 1
b_out.append(np.nan)
else:
a_out.append(np.nan)
b_out.append(b[j])
j += 1
if i < j:
a_out.extend(a[i:])
b_out.extend([np.nan] * len(a[i:]))
elif i > j:
b_out.extend(b[j:])
a_out.extend([np.nan] * len(b[j:]))
df_out = pd.DataFrame({'column A': a_out,
'column B': b_out})