Store smallest number from a list based on criteria

Question:

I have a list of file names as strings where I want to store, in a list, the file name with the minimum ending number relative to file names that have the same beginning denotation.

Example: For any file names in the list beginning with ‘2022-04-27_Cc1cPL3punY’, I’d only want to store the file name with the minimum value of the number at the end. In this case, it would be the file name with 2825288523641594007, and so on for other beginning denotations.

files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
         '2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
         '2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
         '2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
         '2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
         '2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
         '2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
         '2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
         '2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
         '2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
         '2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
         '2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
         '2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
         '2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
Asked By: Precog

||

Answers:

Your best bet is probably to use the pandas library, it is very good at dealing with tabular data.

import pandas as pd

file_name_list = []  # Fill in here
file_name_list = [file_name[:-4] for file_name in file_name_list] # Get rid of .jpg
file_name_series = pd.Series(file_name_list) # Put the data in pandas
file_name_table = file_name_series.str.split("_", expand=True) # Split the strings
file_name_table.columns = ['date', 'prefix', 'number'] # Renaming for readability
file_name_table['number'] = file_name_table['number'].astype(int)
smallest_file_names = file_name_table.groupby(by=['date', 'prefix'])['number'].min()
smallest_file_names_list = smallest_file_names.to_list()
smallest_file_names_list = [file_name+'.jpg' for file_name in smallest_file_names_list] # Putting the .jpg back
Answered By: David_Leber

If the same pattern is being followed, you can try to split each name by a separator (In your example ‘.’ and ‘_’. Documentation on how split works here), and then sort that list by sorting a list of lists, as explained here. This will need to be done per each "ID", as I will call each group identifier’s, so we’ll first need to get the unique IDs, and then iterate them. After that, we can proceed with the splitting. By doing this, you’ll get a list of lists with the complete file name in position 0, and the number from the suffix in position 1

prefix = list(set([pre.split('_')[1] for pre in names]))

names_split = []

for pre in prefix:
    names_split.append([pre,[[name, name.split('.')[0].split('_')[2]] for name in names if name.split('_')[1] == pre]])

for i in range(len(prefix)):
    names_split[i][1] =sorted(names_split[i][1], key=lambda x: int(x[1]))

print(names_split)

The file you need should be names_split[x][0][0] where x identifies each ID.

PS: If you need to find a particular ID, you can use

searched_index = [value[0] for value in names_split].index(ID)

and then names_split[searched_index][0][0]]

Edit: Changed the splitted characters order and added docs on split method

Edit 2: Added prefix grouping

Answered By: Sebastian Olivos

Given that your files would already be sorted in ascending order form your OS/file-manager, you can just find the first one from each common prefix

files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
 '2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
 '2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
 '2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
 '2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
 '2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
 '2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
 '2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
 '2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
 '2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
 '2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
 '2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
 '2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
 '2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']

prefix_old = None
prefix = None
for f in files:
  parts = f.split('_', 2)
  prefix = '_'.join(parts[:2])
  if prefix != prefix_old:
    value = parts[2].split('.')[0]
    print(f'Min value with prefix {prefix} is {value}')
    prefix_old = prefix

Output

Min value with prefix 2022-04-27_Cc1a6yWpUeQ is 2825282726106954381
Min value with prefix 2022-04-27_Cc1cPL3punY is 2825288523641594007
Min value with prefix 2022-04-27_Cc1dN3Rp0es is 2825292832374568690
Answered By: OneCricketeer

It seems that the list of files you have is already sorted according to groups of prefixes, and then according to the numbers. If that’s indeed the case, you just need to take the first path of each prefix group. This can be done easily with itertools.groupby:

for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
    print(key, "min:", next(group))

If you can’t rely that they are internally ordered, find the minimum of each group according to the number:

for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
    print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))

And if you can’t even rely that it’s ordered by groups, just sort the list beforehand:

files.sort(key=lambda file: file.rsplit('_', 1)[0])
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
    print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
Answered By: Tomerikoo
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.