How to group overlapping ranges of substrings?

Question

I have a list of dictionary in the following format:

ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]

The start_offset and end_offset indicate the start and end positions of a substring in a string.

My aim is to group together overlapping strings to form one row only.
The start_offset would be the lowest position and the end_offset would be the highest position.

Example of output:

ldict = [
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]

My attempt:

import pandas as pd
final = []
for row in ldict:
  i1 = pd.Interval(row['start_offset'], row['end_offset'])
  semi_fin_list = []
  for one_row in ldict:
     i2 = pd.Interval(one_row['start_offset'], one_row['end_offset'])
     if i1.overlaps(i2):
         semi_fin_list.append(once)
  final.append(semi_fin_list)

In the attempt above, I could get the overlaps for a row but was stuck on what I could do next to sort and combine the rows to keep distinct rows.

How could I achieve the same? My attempt has not reached the conclusion as I still get duplicates.

Asked By: nifeco

||

Source

Answer 1

ldict = [
    {'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
    {'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
    {'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
    {'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
    {'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]

string_type = []
new_ldict = []
i = 0
while i < len(ldict):
    start_offset = ldict[i]['start_offset']
    end_offset = ldict[i]['end_offset']
    string_type = [ldict[i]['string_type']]
    while i + 1 < len(ldict) and ldict[i + 1]['start_offset'] <= end_offset:
        end_offset = ldict[i + 1]['end_offset']
        string_type.append(ldict[i + 1]['string_type'])
        i += 1

    new_ldict.append({'stat_offset': start_offset, 'end_offset': end_offset, 'string_type': string_type})
    i += 1
print(new_ldict)

Output:

[{'stat_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']}, {'stat_offset': 20, 'end_offset': 30, 'string_type': ['noun']}, {'stat_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}]

Answered By: Ze'ev Ben-Tsvi

Answer 2

You could sort based on start_offset before merging:

ldict = [
    {'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
    {'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
    {'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
    {'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
    {'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'},
]
sorted_ldict = sorted(ldict, key=lambda d: d['start_offset'])
merged_ldict = [
    {
        'start_offset': sorted_ldict[0]['start_offset'],
        'end_offset': sorted_ldict[0]['end_offset'],
        'string_type': [sorted_ldict[0]['string_type']],
    }
]
for d in sorted_ldict[1:]:
    if d['start_offset'] > merged_ldict[-1]['end_offset']:
        merged_ldict.append(
            {
                'start_offset': d['start_offset'],
                'end_offset': d['end_offset'],
                'string_type': [d['string_type']],
            }
        )
    else:
        merged_ldict[-1]['end_offset'] = 
            max(merged_ldict[-1]['end_offset'], d['end_offset'])
        if d['string_type'] not in merged_ldict[-1]['string_type']:
            merged_ldict[-1]['string_type'].append(d['string_type'])
print(merged_ldict)

Output:

[
     {'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']}, 
     {'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']}, 
     {'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]

Note: You could consider using something like a dataclass rather than a raw dictionary.

Answered By: Sash Sinha

Answer 3

All you need is to iterate over ldict and compare 'end_offset' of previous item to start_offset of current. Assuming your ldict is sorted by 'start_offset' you can use next code:

res = []
for d in ldict:
    if not res or d['start_offset'] > last['end_offset']:
        last = {**d, 'string_type': [d['string_type']]}
        res.append(last)
    else:
        last['end_offset'] = d['end_offset']
        last['string_type'].append(d['string_type'])

If your ldict is not sorted, you should sort it before:

from operator import itemgetter

...

ldict = sorted(ldict, key=itemgetter('start_offset'))

Output:

[
    {'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
    {'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
    {'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]

Answered By: Olvin Roght

How to group overlapping ranges of substrings?

Question:

Answers: