Use unique groups to remove files from pathway

Question:

I have files with different dates but same tags per group. From these, I want to keep only the most recent file from each group. In code terms, I can achieve this with a dictionary where these tags are turned into keys. However, in the pathway only the most recent file of one group remains there.

test = ['Group_2020-01-03_ABC_Blue_2018-12-18.csv',
'Group_2020-01-13_ABC_Blue_2018-12-18.csv',
'Group_2020-01-24_ABC_Blue_2018-12-18.csv',
'Group_2020-01-03_DEF_Red_2019-01-30.csv',
'Group_2020-01-13_DEF_Red_2019-01-30.csv',
'Group_2020-01-24_DEF_Red_2019-01-30.csv',
'Group_2020-01-03_GHI_Green_2019-03-28.csv',
'Group_2020-01-13_GHI_Green_2019-03-28.csv',
'Group_2020-01-24_GHI_Green_2019-03-28.csv']

dictionary = {}
for file in glob.glob(path + '*'): # or test
    key = os.path.basename(file).split('_',2)[-1].split('.')[0]
    group = dictionary.get(key,[])
    group.append(os.path.basename(file))  
    dictionary[key] = group

Which output is:

{'ABC_Blue_2018-12-18': ['Group_2020-01-03_ABC_Blue_2018-12-18.csv', 
    'Group_2020-01-13_ABC_Blue_2018-12-18.csv',
    'Group_2020-01-24_ABC_Blue_2018-12-18.csv'],
 'DEF_Red_2019-01-30': ['Group_2020-01-03_DEF_Red_2019-01-30.csv',
    'Group_2020-01-13_DEF_Red_2019-01-30.csv',
    'Group_2020-01-24_DEF_Red_2019-01-30.csv'],
 'GHI_Green_2019-03-28': ['Group_2020-01-03_GHI_Green_2019-03-28.csv',
    'Group_2020-01-13_GHI_Green_2019-03-28.csv',
    'Group_2020-01-24_GHI_Green_2019-03-28.csv']}

When I want to remove those files from 2020-01-03 and 2020-01-13, then there is only one from 2020-01-24 at the pathway instead of one per group. My understanding is that those groups do not exist at the pathway, then os.remove just take one of them, but I cannot figure out how to make it do the same than inside in the dictionary.

for k,v in dictionary.items():
print(k)
for file in v:
    print(file)
    if os.path.join(path, file) != max(glob.glob(path + '*')):
        test.remove(file)
        # os.remove(os.path.join(path, file))

The printing of the key and values shows the groups properly assigned, and removing them happens as desirable.

ABC_Blue_2018-12-18
Group_2020-01-03_ABC_Blue_2018-12-18.csv
Group_2020-01-13_ABC_Blue_2018-12-18.csv
Group_2020-01-24_ABC_Blue_2018-12-18.csv
DEF_Red_2019-01-30
Group_2020-01-03_DEF_Red_2019-01-30.csv
Group_2020-01-13_DEF_Red_2019-01-30.csv
Group_2020-01-24_DEF_Red_2019-01-30.csv
GHI_Dekalb W_2019-03-28
Group_2020-01-03_GHI_Green_2019-03-28.csv
Group_2020-01-13_GHI_Green_2019-03-28.csv
Group_2020-01-24_GHI_Green_2019-03-28.csv

Result from LIST (desired):

['Group_2020-01-24_ABC_Blue_2018-12-18.csv',
 'Group_2020-01-24_DEF_Red_2019-01-30.csv',
 'Group_2020-01-24_GHI_Green_2019-03-28.csv']

Result from PATHWAY:

'Group_2020-01-24_GHI_Green_2019-03-28.csv'

Additionally, if I add glob.glob to refer to the pathway, files are deleted but it prompts an error looking for the file that was just deleted. Running the code again keeps deleting files and with the same error.

dictionary = {}
for file in glob.glob(path + '*'):
    
    key = os.path.basename(file).split('_',2)[-1].split('.')[0]
    group = dictionary.get(key,[])
    group.append(os.path.basename(file))  
    dictionary[key] = group
    
    for k,v in dictionary.items():
        if os.path.basename(file) != max(v):
            os.remove(file)

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'C:\path\Group_2020-01-03_GHI_Green_2018-12-18.csv'
Asked By: Gerlex

||

Answers:

I managed to get the following functional code.

As desired, it removes the older files based on the date that appears first in the title of each file. In this way, if you update the pathway with newer files with the same tag, then you would remove those that are not the most recent anymore.

import os
from datetime import datetime

directory = "path"

# create a dictionary to store the most recent files from each group
most_recent_files = {}

for file in os.listdir(directory):

    file = str(file)[:-4]
    file_parts = file.split("_") # split to extract date and group name
    file_date = datetime.strptime(file_parts[1], "%Y-%m-%d")  # convert to datetime object
    group_name = "_".join(file_parts[2:5])
    file = file + '.csv'

    # update dictionary with most recent file for each group
    if group_name in most_recent_files:
        if file_date > most_recent_files[group_name][0]:
            os.remove(os.path.join(directory, most_recent_files[group_name][1]))  # remove older file
            most_recent_files[group_name] = (file_date, file)  # update most recent file
        else:
            os.remove(os.path.join(directory, file))  # remove current, older, file
    else:
        most_recent_files[group_name] = (file_date, file)  # add first file to dictionary
    
# print recent files from each group
for group, file_info in most_recent_files.items():
    print(f"{group}: {file_info[1]}")
Answered By: Gerlex
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.