Group a list of strings by similar values

Question:

I want to group a list of strings by similarity.

In my case (here simplified cause this list can be huge) it’s a list of path to zip files like this:

["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
 "path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
 "path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip",
 "path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip",
 "path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]

I would like to group the strings in that list by a key, but I don’t know yet how to define it (I guess with a lambda but I can’t figure it out) in order to get a result list like this:

[["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
  "path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip"],
 ["path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip"],
 ["path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
  "path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]]

To give you an example the first grouping key would be:

*_HELICOPTEROS-MARINOS_20230329_*_21049_00748_*.zip

second would be:

*_HELICOPTEROS-MARINOS_20230329_*_21049_00747_*.zip

and third:

*_NOLAS_20230326_*_20160_06473_*.zip
Asked By: FrozzenFinger

||

Answers:

It’s all about extracting the required key to be used to group the file names together.

Here’s a simplified function extract_features that assumes that there are no additional _ in the filename apart from its standard format. It can be modified as per your file name convention to extract the required key, and then group them together using the itertools.groupby()

from itertools import groupby

def extract_features(f):
    filename = f.split('/')[-1]
    parts = filename.split('_')
    return (parts[2], parts[3], parts[6], parts[7])

data = ["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
 "path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
 "path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip",
 "path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip",
 "path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]

data.sort(key=extract_features)
output = []

for k, g in groupby(data, extract_features):
    output.append(list(g))

print(output)

Output:

[['path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip'],
['path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip', 'path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip'],
['path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip', 'path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip']]

e.g. for the path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip the sorting key would be ('HELICOPTEROS-MARINOS', '20230329', '21049', '00748')

Answered By: Jay

A more flexible approach, would be to use regex combined with groupby :

import re
from itertools import groupby

pat = r'^(.*)_(d{8})_d{6}_(d+Td)_(d{5})_(d{5})_[A-Z_]+.zip$'
​
def sort_keys(x):
    groups = re.match(pat, x).groups()
    return (groups[0], groups[1], groups[2], groups[3], groups[4])

sdata = sorted(data, key=sort_keys)
out = [list(g) for k, g in groupby(sdata, key=sort_keys)]

NB : To have more/less sort keys, you just need to add/remove parenthesis in the pattern.

Output :

[['path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip',
  'path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip'],
 ['path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip'],
 ['path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip',
  'path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip']]
Answered By: Timeless