How to drop all the rows between two delimiters
Question:
Please find below my input/output :
Input :
Col1
Green
Purple
Start delimiter
abchd
oeitms
End delimiter
Yellow
Start delimiter
kdkfkd
dldldldl
mdmdmdm
End delimiter
Red
Brown
Rose
White
Output (desired) :
Col1
Green
Purple
Yellow
Red
Brown
Rose
White
Basically, I’m trying to drop all the rows between Start delimiter
and End delimiter
(including their rows too).
The problem is the distance between these two is not fix !
I tried the following code but without any success :
m1 = df['Col1'] == 'Start delimiter'
m2 = df['Col1'] == 'End delimiter'
df.loc[~m1&~m2]
Do you have any suggestions, please?
Answers:
You could try:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"col1":["red","blue","Start delimiter","black", "white", "End delimiter", "green", "blue",
"Start delimiter", "red", "green", "anothercolor", "End delimiter", "again_anothercolor"]}
)
idx_start = df.index[df.col1 == "Start delimiter"]
idx_end = df.index[df.col1 == "End delimiter"]
rows2drop = []
for start, end in zip(idx_start, idx_end):
rows2drop.extend(np.arange(start, end+1))
df.drop(rows2drop, axis=0)
Initial dataframe
+----+--------------------+
| | col1 |
|----+--------------------|
| 0 | red |
| 1 | blue |
| 2 | Start delimiter |
| 3 | black |
| 4 | white |
| 5 | End delimiter |
| 6 | green |
| 7 | blue |
| 8 | Start delimiter |
| 9 | red |
| 10 | green |
| 11 | anothercolor |
| 12 | End delimiter |
| 13 | again_anothercolor |
+----+--------------------+
Dataframe after drop()
+----+--------------------+
| | col1 |
|----+--------------------|
| 0 | red |
| 1 | blue |
| 6 | green |
| 7 | blue |
| 13 | again_anothercolor |
+----+--------------------+
try:
df
Col1
0 Green
1 Purple
2 Start delimiter
3 abchd
4 oeitms
5 End delimiter
6 Yellow
7 Start delimiter
8 kdkfkd
9 dldldldl
10 mdmdmdm
11 End delimiter
12 Red
13 Brown
14 Rose
15 White
a = False
b = []
for i in df['Col1'].tolist():
print(i)
if i == 'Start delimiter':
a = True
b.append(i)
elif i == 'End delimiter':
a = False
b.append(i)
elif a:
b.append(i)
#b --> ['Start delimiter', 'abchd', 'oeitms', 'End delimiter', 'Start delimiter', 'kdkfkd', 'dldldldl', 'mdmdmdm', 'End delimiter']
df = df[~df['Col1'].isin(b)]
df
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White
For a vectorial approach you can use:
d = {'Start delimiter': False,
'End delimiter': True}
# assign False after Start
# True after End
# True otherwise
m1 = df['Col1'].map(d).ffill().fillna(True)
# ensure not an End
m2 = df['Col1'].ne('End delimiter')
# if both conditions are met, keep the rows
out = df.loc[m1&m2]
Output:
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White
Here’s a function to do just that.
def cleanlist(list_to_clean,start_clean,end_clean) :
clean_list =[]
clean_status = False
for x in list_to_clean:
if x == start_clean :
clean_status=True
elif clean_status == False:
clean_list.append(x)
elif x == end_clean :
clean_status = False
return clean_list
print(cleanlist(list,'Start delimiter','End delimiter'))
One option is to build the list of indices that are not within the starts and ends, and index df
with that. This assumes that for every Start delimiter, there is a corresponding End delimiter:
start = df.Col1.eq('Start delimiter').to_numpy().nonzero()[0]
end = df.Col1.eq('End delimiter').to_numpy().nonzero()[0] + 1
combo = [np.arange(s, e) for s, e in zip(start, end)]
combo = np.concatenate(combo)
keep = df.index.difference(combo)
df.loc[keep]
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White
You can also use mod()
s = 'Start delimiter'
e = 'End delimiter'
df.loc[~df['Col1'].isin([s,e]).cumsum().mod(2) & df['Col1'].ne(e)]
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White
Please find below my input/output :
Input :
Col1
Green
Purple
Start delimiter
abchd
oeitms
End delimiter
Yellow
Start delimiter
kdkfkd
dldldldl
mdmdmdm
End delimiter
Red
Brown
Rose
White
Output (desired) :
Col1
Green
Purple
Yellow
Red
Brown
Rose
White
Basically, I’m trying to drop all the rows between Start delimiter
and End delimiter
(including their rows too).
The problem is the distance between these two is not fix !
I tried the following code but without any success :
m1 = df['Col1'] == 'Start delimiter'
m2 = df['Col1'] == 'End delimiter'
df.loc[~m1&~m2]
Do you have any suggestions, please?
You could try:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"col1":["red","blue","Start delimiter","black", "white", "End delimiter", "green", "blue",
"Start delimiter", "red", "green", "anothercolor", "End delimiter", "again_anothercolor"]}
)
idx_start = df.index[df.col1 == "Start delimiter"]
idx_end = df.index[df.col1 == "End delimiter"]
rows2drop = []
for start, end in zip(idx_start, idx_end):
rows2drop.extend(np.arange(start, end+1))
df.drop(rows2drop, axis=0)
Initial dataframe
+----+--------------------+
| | col1 |
|----+--------------------|
| 0 | red |
| 1 | blue |
| 2 | Start delimiter |
| 3 | black |
| 4 | white |
| 5 | End delimiter |
| 6 | green |
| 7 | blue |
| 8 | Start delimiter |
| 9 | red |
| 10 | green |
| 11 | anothercolor |
| 12 | End delimiter |
| 13 | again_anothercolor |
+----+--------------------+
Dataframe after drop()
+----+--------------------+
| | col1 |
|----+--------------------|
| 0 | red |
| 1 | blue |
| 6 | green |
| 7 | blue |
| 13 | again_anothercolor |
+----+--------------------+
try:
df
Col1
0 Green
1 Purple
2 Start delimiter
3 abchd
4 oeitms
5 End delimiter
6 Yellow
7 Start delimiter
8 kdkfkd
9 dldldldl
10 mdmdmdm
11 End delimiter
12 Red
13 Brown
14 Rose
15 White
a = False
b = []
for i in df['Col1'].tolist():
print(i)
if i == 'Start delimiter':
a = True
b.append(i)
elif i == 'End delimiter':
a = False
b.append(i)
elif a:
b.append(i)
#b --> ['Start delimiter', 'abchd', 'oeitms', 'End delimiter', 'Start delimiter', 'kdkfkd', 'dldldldl', 'mdmdmdm', 'End delimiter']
df = df[~df['Col1'].isin(b)]
df
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White
For a vectorial approach you can use:
d = {'Start delimiter': False,
'End delimiter': True}
# assign False after Start
# True after End
# True otherwise
m1 = df['Col1'].map(d).ffill().fillna(True)
# ensure not an End
m2 = df['Col1'].ne('End delimiter')
# if both conditions are met, keep the rows
out = df.loc[m1&m2]
Output:
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White
Here’s a function to do just that.
def cleanlist(list_to_clean,start_clean,end_clean) :
clean_list =[]
clean_status = False
for x in list_to_clean:
if x == start_clean :
clean_status=True
elif clean_status == False:
clean_list.append(x)
elif x == end_clean :
clean_status = False
return clean_list
print(cleanlist(list,'Start delimiter','End delimiter'))
One option is to build the list of indices that are not within the starts and ends, and index df
with that. This assumes that for every Start delimiter, there is a corresponding End delimiter:
start = df.Col1.eq('Start delimiter').to_numpy().nonzero()[0]
end = df.Col1.eq('End delimiter').to_numpy().nonzero()[0] + 1
combo = [np.arange(s, e) for s, e in zip(start, end)]
combo = np.concatenate(combo)
keep = df.index.difference(combo)
df.loc[keep]
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White
You can also use mod()
s = 'Start delimiter'
e = 'End delimiter'
df.loc[~df['Col1'].isin([s,e]).cumsum().mod(2) & df['Col1'].ne(e)]
Col1
0 Green
1 Purple
6 Yellow
12 Red
13 Brown
14 Rose
15 White