How to group overlapping ranges of substrings?
Question:
I have a list of dictionary in the following format:
ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]
The start_offset
and end_offset
indicate the start and end positions of a substring in a string.
My aim is to group together overlapping strings to form one row only.
The start_offset
would be the lowest position and the end_offset
would be the highest position.
Example of output:
ldict = [
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]
My attempt:
import pandas as pd
final = []
for row in ldict:
i1 = pd.Interval(row['start_offset'], row['end_offset'])
semi_fin_list = []
for one_row in ldict:
i2 = pd.Interval(one_row['start_offset'], one_row['end_offset'])
if i1.overlaps(i2):
semi_fin_list.append(once)
final.append(semi_fin_list)
In the attempt above, I could get the overlaps for a row but was stuck on what I could do next to sort and combine the rows to keep distinct rows.
How could I achieve the same? My attempt has not reached the conclusion as I still get duplicates.
Answers:
ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]
string_type = []
new_ldict = []
i = 0
while i < len(ldict):
start_offset = ldict[i]['start_offset']
end_offset = ldict[i]['end_offset']
string_type = [ldict[i]['string_type']]
while i + 1 < len(ldict) and ldict[i + 1]['start_offset'] <= end_offset:
end_offset = ldict[i + 1]['end_offset']
string_type.append(ldict[i + 1]['string_type'])
i += 1
new_ldict.append({'stat_offset': start_offset, 'end_offset': end_offset, 'string_type': string_type})
i += 1
print(new_ldict)
Output:
[{'stat_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']}, {'stat_offset': 20, 'end_offset': 30, 'string_type': ['noun']}, {'stat_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}]
You could sort based on start_offset
before merging:
ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'},
]
sorted_ldict = sorted(ldict, key=lambda d: d['start_offset'])
merged_ldict = [
{
'start_offset': sorted_ldict[0]['start_offset'],
'end_offset': sorted_ldict[0]['end_offset'],
'string_type': [sorted_ldict[0]['string_type']],
}
]
for d in sorted_ldict[1:]:
if d['start_offset'] > merged_ldict[-1]['end_offset']:
merged_ldict.append(
{
'start_offset': d['start_offset'],
'end_offset': d['end_offset'],
'string_type': [d['string_type']],
}
)
else:
merged_ldict[-1]['end_offset'] =
max(merged_ldict[-1]['end_offset'], d['end_offset'])
if d['string_type'] not in merged_ldict[-1]['string_type']:
merged_ldict[-1]['string_type'].append(d['string_type'])
print(merged_ldict)
Output:
[
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]
Note: You could consider using something like a dataclass rather than a raw dictionary.
All you need is to iterate over ldict
and compare 'end_offset'
of previous item to start_offset
of current. Assuming your ldict
is sorted by 'start_offset'
you can use next code:
res = []
for d in ldict:
if not res or d['start_offset'] > last['end_offset']:
last = {**d, 'string_type': [d['string_type']]}
res.append(last)
else:
last['end_offset'] = d['end_offset']
last['string_type'].append(d['string_type'])
If your ldict
is not sorted, you should sort it before:
from operator import itemgetter
...
ldict = sorted(ldict, key=itemgetter('start_offset'))
Output:
[
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]
I have a list of dictionary in the following format:
ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]
The start_offset
and end_offset
indicate the start and end positions of a substring in a string.
My aim is to group together overlapping strings to form one row only.
The start_offset
would be the lowest position and the end_offset
would be the highest position.
Example of output:
ldict = [
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]
My attempt:
import pandas as pd
final = []
for row in ldict:
i1 = pd.Interval(row['start_offset'], row['end_offset'])
semi_fin_list = []
for one_row in ldict:
i2 = pd.Interval(one_row['start_offset'], one_row['end_offset'])
if i1.overlaps(i2):
semi_fin_list.append(once)
final.append(semi_fin_list)
In the attempt above, I could get the overlaps for a row but was stuck on what I could do next to sort and combine the rows to keep distinct rows.
How could I achieve the same? My attempt has not reached the conclusion as I still get duplicates.
ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]
string_type = []
new_ldict = []
i = 0
while i < len(ldict):
start_offset = ldict[i]['start_offset']
end_offset = ldict[i]['end_offset']
string_type = [ldict[i]['string_type']]
while i + 1 < len(ldict) and ldict[i + 1]['start_offset'] <= end_offset:
end_offset = ldict[i + 1]['end_offset']
string_type.append(ldict[i + 1]['string_type'])
i += 1
new_ldict.append({'stat_offset': start_offset, 'end_offset': end_offset, 'string_type': string_type})
i += 1
print(new_ldict)
Output:
[{'stat_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']}, {'stat_offset': 20, 'end_offset': 30, 'string_type': ['noun']}, {'stat_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}]
You could sort based on start_offset
before merging:
ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'},
]
sorted_ldict = sorted(ldict, key=lambda d: d['start_offset'])
merged_ldict = [
{
'start_offset': sorted_ldict[0]['start_offset'],
'end_offset': sorted_ldict[0]['end_offset'],
'string_type': [sorted_ldict[0]['string_type']],
}
]
for d in sorted_ldict[1:]:
if d['start_offset'] > merged_ldict[-1]['end_offset']:
merged_ldict.append(
{
'start_offset': d['start_offset'],
'end_offset': d['end_offset'],
'string_type': [d['string_type']],
}
)
else:
merged_ldict[-1]['end_offset'] =
max(merged_ldict[-1]['end_offset'], d['end_offset'])
if d['string_type'] not in merged_ldict[-1]['string_type']:
merged_ldict[-1]['string_type'].append(d['string_type'])
print(merged_ldict)
Output:
[
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]
Note: You could consider using something like a dataclass rather than a raw dictionary.
All you need is to iterate over ldict
and compare 'end_offset'
of previous item to start_offset
of current. Assuming your ldict
is sorted by 'start_offset'
you can use next code:
res = []
for d in ldict:
if not res or d['start_offset'] > last['end_offset']:
last = {**d, 'string_type': [d['string_type']]}
res.append(last)
else:
last['end_offset'] = d['end_offset']
last['string_type'].append(d['string_type'])
If your ldict
is not sorted, you should sort it before:
from operator import itemgetter
...
ldict = sorted(ldict, key=itemgetter('start_offset'))
Output:
[
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]