Randomly splitting 1 file from many files based on ID
Question:
In my dataset, I have a large number of images in jpg format and they are named [ID]_[Cam]_[Frame].jpg. The dataset contains many IDs, and every ID has a different number of image. I want to randomly take 1 image from each ID into a different set of images. The problem is that the IDs in the dataset aren’t always in order (Sometimes jump and skipped some numbers). As for the example below, the set of files doesn’t have ID number 2 and 3.
Is there any python code to do this?
Before
- TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000001.jpg
- 00000000_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000001_0002_00000001.jpg
- 00000001_0002_00000002.jpg
- 00000004_0001_00000001.jpg
- 00000004_0002_00000001.jpg
After
-
TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000002.jpg
- 00000001_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000004_0001_00000001.jpg
-
ValidationSet
- 00000000_0001_00000001.jpg
- 00000001_0001_00000002.jpg
- 00000004_0001_00000002.jpg
Answers:
You need to use a sort alongwith Datastructure – Dictionary
FOr eg :
myDict = {'a': 00000000_0001_00000000.jpg, 'b': 00000000_0001_00000001.jpg}
myKeys = list(myDict.keys())
myKeys.sort()
sorted_dict = {i: myDict[i] for i in myKeys}
print(sorted_dict)
In this case, I would use a dictionary with id as the key and list of the name of files with matching id as the value. Then randomly picks the array from the dict.
import os
from random import choice
from pathlib import Path
import shutil
source_folder = "SOURCE_FOLDER"
dest_folder = "DEST_FOLDER"
dir_list = os.listdir(source_folder)
ids = {}
for f in dir_list:
f_id = f.split("_")[0]
ids[f_id] = [f, *ids.get(f_id, [])]
Path(dest_folder).mkdir(parents=True, exist_ok=True)
for files in ids.values():
random_file = choice(files)
shutil.move(
os.path.join(source_folder, random_file), os.path.join(dest_folder, random_file)
)
In your case, replace SOURCE_FOLDER
with TrainSet
and DEST_FOLDER
with ValidationSet
.
Here’s a Pandas DataFrame solution that negates the need to move the files between folders. The str.extract method can extract the text matching a regex pattern as new columns in a DataFrame. The file names are grouped by the values in the newly created f_id
column. The groupby.sample method returns a random sample from each group and the random_state
parameter allows reproducibility.
import numpy as np
import pandas as pd
# Load file names into a data frame
data = [
{"fname": "00000000_0001_00000000.jpg"},
{"fname": "00000000_0001_00000001.jpg"},
{"fname": "00000000_0002_00000001.jpg"},
{"fname": "00000001_0001_00000001.jpg"},
{"fname": "00000001_0002_00000001.jpg"},
{"fname": "00000001_0002_00000002.jpg"},
{"fname": "00000004_0001_00000001.jpg"},
{"fname": "00000004_0002_00000001.jpg"},
]
df = pd.DataFrame(data)
# Extract 'f_id' from 'fname' string
df = df.join(df["fname"].str.extract(r'^(?P<f_id>d+)_'))
sample_size = 1 # sample size
state_seed = 43 # reproducible
group_list = ["f_id"]
# Add 'validation' column
df["validation"] = 0
# Increment 'validation' by 1 for selected samples
df["validation"] = df.groupby(group_list).sample(n=sample_size, random_state=state_seed)["validation"].add(1)
# Reset 'NaN' values to 0
df["validation"] = df["validation"].fillna(0).astype(np.int8)
The result is a DataFrame with a value of 1 in the validation
column for the selected file names.
fname
f_id
validation
0
00000000_0001_00000000.jpg
00000000
0
1
00000000_0001_00000001.jpg
00000000
1
2
00000000_0002_00000001.jpg
00000000
0
3
00000001_0001_00000001.jpg
00000001
1
4
00000001_0002_00000001.jpg
00000001
0
5
00000001_0002_00000002.jpg
00000001
0
6
00000004_0001_00000001.jpg
00000004
0
7
00000004_0002_00000001.jpg
00000004
1
In my dataset, I have a large number of images in jpg format and they are named [ID]_[Cam]_[Frame].jpg. The dataset contains many IDs, and every ID has a different number of image. I want to randomly take 1 image from each ID into a different set of images. The problem is that the IDs in the dataset aren’t always in order (Sometimes jump and skipped some numbers). As for the example below, the set of files doesn’t have ID number 2 and 3.
Is there any python code to do this?
Before
- TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000001.jpg
- 00000000_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000001_0002_00000001.jpg
- 00000001_0002_00000002.jpg
- 00000004_0001_00000001.jpg
- 00000004_0002_00000001.jpg
After
-
TrainSet
- 00000000_0001_00000000.jpg
- 00000000_0001_00000002.jpg
- 00000001_0002_00000001.jpg
- 00000001_0001_00000001.jpg
- 00000004_0001_00000001.jpg
-
ValidationSet
- 00000000_0001_00000001.jpg
- 00000001_0001_00000002.jpg
- 00000004_0001_00000002.jpg
You need to use a sort alongwith Datastructure – Dictionary
FOr eg :
myDict = {'a': 00000000_0001_00000000.jpg, 'b': 00000000_0001_00000001.jpg}
myKeys = list(myDict.keys())
myKeys.sort()
sorted_dict = {i: myDict[i] for i in myKeys}
print(sorted_dict)
In this case, I would use a dictionary with id as the key and list of the name of files with matching id as the value. Then randomly picks the array from the dict.
import os
from random import choice
from pathlib import Path
import shutil
source_folder = "SOURCE_FOLDER"
dest_folder = "DEST_FOLDER"
dir_list = os.listdir(source_folder)
ids = {}
for f in dir_list:
f_id = f.split("_")[0]
ids[f_id] = [f, *ids.get(f_id, [])]
Path(dest_folder).mkdir(parents=True, exist_ok=True)
for files in ids.values():
random_file = choice(files)
shutil.move(
os.path.join(source_folder, random_file), os.path.join(dest_folder, random_file)
)
In your case, replace SOURCE_FOLDER
with TrainSet
and DEST_FOLDER
with ValidationSet
.
Here’s a Pandas DataFrame solution that negates the need to move the files between folders. The str.extract method can extract the text matching a regex pattern as new columns in a DataFrame. The file names are grouped by the values in the newly created f_id
column. The groupby.sample method returns a random sample from each group and the random_state
parameter allows reproducibility.
import numpy as np
import pandas as pd
# Load file names into a data frame
data = [
{"fname": "00000000_0001_00000000.jpg"},
{"fname": "00000000_0001_00000001.jpg"},
{"fname": "00000000_0002_00000001.jpg"},
{"fname": "00000001_0001_00000001.jpg"},
{"fname": "00000001_0002_00000001.jpg"},
{"fname": "00000001_0002_00000002.jpg"},
{"fname": "00000004_0001_00000001.jpg"},
{"fname": "00000004_0002_00000001.jpg"},
]
df = pd.DataFrame(data)
# Extract 'f_id' from 'fname' string
df = df.join(df["fname"].str.extract(r'^(?P<f_id>d+)_'))
sample_size = 1 # sample size
state_seed = 43 # reproducible
group_list = ["f_id"]
# Add 'validation' column
df["validation"] = 0
# Increment 'validation' by 1 for selected samples
df["validation"] = df.groupby(group_list).sample(n=sample_size, random_state=state_seed)["validation"].add(1)
# Reset 'NaN' values to 0
df["validation"] = df["validation"].fillna(0).astype(np.int8)
The result is a DataFrame with a value of 1 in the validation
column for the selected file names.
fname | f_id | validation | |
---|---|---|---|
0 | 00000000_0001_00000000.jpg | 00000000 | 0 |
1 | 00000000_0001_00000001.jpg | 00000000 | 1 |
2 | 00000000_0002_00000001.jpg | 00000000 | 0 |
3 | 00000001_0001_00000001.jpg | 00000001 | 1 |
4 | 00000001_0002_00000001.jpg | 00000001 | 0 |
5 | 00000001_0002_00000002.jpg | 00000001 | 0 |
6 | 00000004_0001_00000001.jpg | 00000004 | 0 |
7 | 00000004_0002_00000001.jpg | 00000004 | 1 |