Classifying list of filenames into its respective types
Question:
So I have this list that contains lots of filenames in a directory with its respective types. Say that the list look like this:
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
and the types of file are stored in a .csv
file like this:
,type
0,apple
1,apple_tea
2,apple_town
I want to classify each filename in the list into its respective type of file and put them into a dictionary. Say that the dictionary would look like this after processed:
dictionary = {
'apple':['apple-20220103.csv'],
'apple_tea':['apple_tea-20220304.csv'],
'apple_town':['20220203-apple_town.csv', 'apple_town20220101.csv'
}
The question is how can I ensure so that apple
would not receive any file besides apple-20220103.csv
, despite other filenames also contain the word apple
in it? I've tried using simple regex matching, and the result still has apple_tea
and apple_town
filenames in apple
.
Answers:
You could match everything which is not a number or a dash by the pattern given below.
Then you can use the complete match as a key for your dictionary.
your_list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
pattern = '[^0-9-]+'
for element in your_list:
a=re.search(pattern, element[:-4])
print(a.group())
# Output
apple
apple_tea
apple_town
apple_town
Please take look at word boundary b
import re
filenames = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
categories = ['apple','apple_tea','apple_town']
for category in categories:
print(category)
pattern = r'b' + re.escape(category) + r'b'
for filename in filenames:
if re.search(pattern, filename):
print(filename)
print()
gives output
apple
apple-20220103.csv
apple_tea
apple_tea-20220304.csv
apple_town
20220203-apple_town.csv
From re
docs
b
Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of word characters. Note that
formally, b
is defined as the boundary between a w
and a W
character (or vice versa), or between w
and the beginning/end of
the string.(...)
I also use re.escape
to make sure that if character of special meaning will appear in category name (e.g. dot) they will be treated as literal character.
One approach to the problem could be to use the library difflib
.
import pandas as pd
import difflib
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
csv_file = pd.read_csv("file.csv")
thisdict = {}
for row in csv_file.iterrows():
close = difflib.get_close_matches(row[1][1], list, len(list), 0)
thisdict[str(row[1][1])] = close[0]
print(thisdict)
This produces the following output.
{'apple': 'apple-20220103.csv', 'apple_tea': 'apple_tea-20220304.csv', 'apple_town': 'apple_town20220101.csv'}
Notice that only the closest string gets put into the dictionary.
So I have this list that contains lots of filenames in a directory with its respective types. Say that the list look like this:
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
and the types of file are stored in a .csv
file like this:
,type
0,apple
1,apple_tea
2,apple_town
I want to classify each filename in the list into its respective type of file and put them into a dictionary. Say that the dictionary would look like this after processed:
dictionary = {
'apple':['apple-20220103.csv'],
'apple_tea':['apple_tea-20220304.csv'],
'apple_town':['20220203-apple_town.csv', 'apple_town20220101.csv'
}
The question is how can I ensure so that apple
would not receive any file besides apple-20220103.csv
, despite other filenames also contain the word apple
in it? I've tried using simple regex matching, and the result still has apple_tea
and apple_town
filenames in apple
.
You could match everything which is not a number or a dash by the pattern given below.
Then you can use the complete match as a key for your dictionary.
your_list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
pattern = '[^0-9-]+'
for element in your_list:
a=re.search(pattern, element[:-4])
print(a.group())
# Output
apple
apple_tea
apple_town
apple_town
Please take look at word boundary b
import re
filenames = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
categories = ['apple','apple_tea','apple_town']
for category in categories:
print(category)
pattern = r'b' + re.escape(category) + r'b'
for filename in filenames:
if re.search(pattern, filename):
print(filename)
print()
gives output
apple
apple-20220103.csv
apple_tea
apple_tea-20220304.csv
apple_town
20220203-apple_town.csv
From re
docs
b
Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of word characters. Note that
formally,b
is defined as the boundary between aw
and aW
character (or vice versa), or betweenw
and the beginning/end of
the string.(...)
I also use re.escape
to make sure that if character of special meaning will appear in category name (e.g. dot) they will be treated as literal character.
One approach to the problem could be to use the library difflib
.
import pandas as pd
import difflib
list = ['apple-20220103.csv', 'apple_tea-20220304.csv', '20220203-apple_town.csv', 'apple_town20220101.csv']
csv_file = pd.read_csv("file.csv")
thisdict = {}
for row in csv_file.iterrows():
close = difflib.get_close_matches(row[1][1], list, len(list), 0)
thisdict[str(row[1][1])] = close[0]
print(thisdict)
This produces the following output.
{'apple': 'apple-20220103.csv', 'apple_tea': 'apple_tea-20220304.csv', 'apple_town': 'apple_town20220101.csv'}
Notice that only the closest string gets put into the dictionary.