Store smallest number from a list based on criteria
Question:
I have a list of file names as strings where I want to store, in a list, the file name with the minimum ending number relative to file names that have the same beginning denotation.
Example: For any file names in the list beginning with ‘2022-04-27_Cc1cPL3punY’, I’d only want to store the file name with the minimum value of the number at the end. In this case, it would be the file name with 2825288523641594007, and so on for other beginning denotations.
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
Answers:
Your best bet is probably to use the pandas library, it is very good at dealing with tabular data.
import pandas as pd
file_name_list = [] # Fill in here
file_name_list = [file_name[:-4] for file_name in file_name_list] # Get rid of .jpg
file_name_series = pd.Series(file_name_list) # Put the data in pandas
file_name_table = file_name_series.str.split("_", expand=True) # Split the strings
file_name_table.columns = ['date', 'prefix', 'number'] # Renaming for readability
file_name_table['number'] = file_name_table['number'].astype(int)
smallest_file_names = file_name_table.groupby(by=['date', 'prefix'])['number'].min()
smallest_file_names_list = smallest_file_names.to_list()
smallest_file_names_list = [file_name+'.jpg' for file_name in smallest_file_names_list] # Putting the .jpg back
If the same pattern is being followed, you can try to split each name by a separator (In your example ‘.’ and ‘_’. Documentation on how split works here), and then sort that list by sorting a list of lists, as explained here. This will need to be done per each "ID", as I will call each group identifier’s, so we’ll first need to get the unique IDs, and then iterate them. After that, we can proceed with the splitting. By doing this, you’ll get a list of lists with the complete file name in position 0, and the number from the suffix in position 1
prefix = list(set([pre.split('_')[1] for pre in names]))
names_split = []
for pre in prefix:
names_split.append([pre,[[name, name.split('.')[0].split('_')[2]] for name in names if name.split('_')[1] == pre]])
for i in range(len(prefix)):
names_split[i][1] =sorted(names_split[i][1], key=lambda x: int(x[1]))
print(names_split)
The file you need should be names_split[x][0][0]
where x
identifies each ID.
PS: If you need to find a particular ID, you can use
searched_index = [value[0] for value in names_split].index(ID)
and then names_split[searched_index][0][0]]
Edit: Changed the splitted characters order and added docs on split method
Edit 2: Added prefix grouping
Given that your files would already be sorted in ascending order form your OS/file-manager, you can just find the first one from each common prefix
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
prefix_old = None
prefix = None
for f in files:
parts = f.split('_', 2)
prefix = '_'.join(parts[:2])
if prefix != prefix_old:
value = parts[2].split('.')[0]
print(f'Min value with prefix {prefix} is {value}')
prefix_old = prefix
Output
Min value with prefix 2022-04-27_Cc1a6yWpUeQ is 2825282726106954381
Min value with prefix 2022-04-27_Cc1cPL3punY is 2825288523641594007
Min value with prefix 2022-04-27_Cc1dN3Rp0es is 2825292832374568690
It seems that the list of files you have is already sorted according to groups of prefixes, and then according to the numbers. If that’s indeed the case, you just need to take the first path of each prefix group. This can be done easily with itertools.groupby
:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", next(group))
If you can’t rely that they are internally ordered, find the minimum of each group according to the number:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
And if you can’t even rely that it’s ordered by groups, just sort the list beforehand:
files.sort(key=lambda file: file.rsplit('_', 1)[0])
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
I have a list of file names as strings where I want to store, in a list, the file name with the minimum ending number relative to file names that have the same beginning denotation.
Example: For any file names in the list beginning with ‘2022-04-27_Cc1cPL3punY’, I’d only want to store the file name with the minimum value of the number at the end. In this case, it would be the file name with 2825288523641594007, and so on for other beginning denotations.
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
Your best bet is probably to use the pandas library, it is very good at dealing with tabular data.
import pandas as pd
file_name_list = [] # Fill in here
file_name_list = [file_name[:-4] for file_name in file_name_list] # Get rid of .jpg
file_name_series = pd.Series(file_name_list) # Put the data in pandas
file_name_table = file_name_series.str.split("_", expand=True) # Split the strings
file_name_table.columns = ['date', 'prefix', 'number'] # Renaming for readability
file_name_table['number'] = file_name_table['number'].astype(int)
smallest_file_names = file_name_table.groupby(by=['date', 'prefix'])['number'].min()
smallest_file_names_list = smallest_file_names.to_list()
smallest_file_names_list = [file_name+'.jpg' for file_name in smallest_file_names_list] # Putting the .jpg back
If the same pattern is being followed, you can try to split each name by a separator (In your example ‘.’ and ‘_’. Documentation on how split works here), and then sort that list by sorting a list of lists, as explained here. This will need to be done per each "ID", as I will call each group identifier’s, so we’ll first need to get the unique IDs, and then iterate them. After that, we can proceed with the splitting. By doing this, you’ll get a list of lists with the complete file name in position 0, and the number from the suffix in position 1
prefix = list(set([pre.split('_')[1] for pre in names]))
names_split = []
for pre in prefix:
names_split.append([pre,[[name, name.split('.')[0].split('_')[2]] for name in names if name.split('_')[1] == pre]])
for i in range(len(prefix)):
names_split[i][1] =sorted(names_split[i][1], key=lambda x: int(x[1]))
print(names_split)
The file you need should be names_split[x][0][0]
where x
identifies each ID.
PS: If you need to find a particular ID, you can use
searched_index = [value[0] for value in names_split].index(ID)
and then names_split[searched_index][0][0]]
Edit: Changed the splitted characters order and added docs on split method
Edit 2: Added prefix grouping
Given that your files would already be sorted in ascending order form your OS/file-manager, you can just find the first one from each common prefix
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
prefix_old = None
prefix = None
for f in files:
parts = f.split('_', 2)
prefix = '_'.join(parts[:2])
if prefix != prefix_old:
value = parts[2].split('.')[0]
print(f'Min value with prefix {prefix} is {value}')
prefix_old = prefix
Output
Min value with prefix 2022-04-27_Cc1a6yWpUeQ is 2825282726106954381
Min value with prefix 2022-04-27_Cc1cPL3punY is 2825288523641594007
Min value with prefix 2022-04-27_Cc1dN3Rp0es is 2825292832374568690
It seems that the list of files you have is already sorted according to groups of prefixes, and then according to the numbers. If that’s indeed the case, you just need to take the first path of each prefix group. This can be done easily with itertools.groupby
:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", next(group))
If you can’t rely that they are internally ordered, find the minimum of each group according to the number:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
And if you can’t even rely that it’s ordered by groups, just sort the list beforehand:
files.sort(key=lambda file: file.rsplit('_', 1)[0])
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))