Getting a percentage of files in a folder
Question:
I wrote a script, and I assigned part of it to select random sample of 10% of the files in a dir and copy them to a new dir. This is my method below, but it gives less than 10% (~9.6%) each time, and never the same amount.
for x in range(int(len(files) *.1)):
to_copy = choice(files)
shutil.copy(os.path.join(subdir, to_copy), os.path.join(output_folder))
this gave
#files source run 1 run 2
29841 2852 2845
1595 152 156
11324 1084 1082
Answers:
By calling random.choice()
repeatedly, you are effectively choosing with replacement. This means that you might be choosing the same file twice, in separate trips around the loop.
Try random.sample()
instead:
for to_copy in random.sample(files, int(len(files)*.1)):
shutil.copy(...)
Consider this program:
import random
seq = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
for _ in range(5):
i = random.choice(seq)
print(i, end=' ')
print()
for i in random.sample(seq, 5):
print(i, end=' ')
print()
Here are two runs of the program:
$ python x.py
g f e c b
c j b a d
$ python x.py
c e a a e
j f e a i
Notice that, in the first line of the second run random.choice()
randomly selected a
twice and e
twice. If these were filenames, it would appear that only 3 files were copied. In fact, five copies are performed, but the redundant copies don’t add to the file count. Of course the number of repeated identical choices is random, leading to the inconsistent behavior you see.
On the other hand, the second line, derived from random.sample()
will never have repeated elements, since random.sample()
chooses without replacement.
I wrote a script, and I assigned part of it to select random sample of 10% of the files in a dir and copy them to a new dir. This is my method below, but it gives less than 10% (~9.6%) each time, and never the same amount.
for x in range(int(len(files) *.1)):
to_copy = choice(files)
shutil.copy(os.path.join(subdir, to_copy), os.path.join(output_folder))
this gave
#files source run 1 run 2
29841 2852 2845
1595 152 156
11324 1084 1082
By calling random.choice()
repeatedly, you are effectively choosing with replacement. This means that you might be choosing the same file twice, in separate trips around the loop.
Try random.sample()
instead:
for to_copy in random.sample(files, int(len(files)*.1)):
shutil.copy(...)
Consider this program:
import random
seq = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
for _ in range(5):
i = random.choice(seq)
print(i, end=' ')
print()
for i in random.sample(seq, 5):
print(i, end=' ')
print()
Here are two runs of the program:
$ python x.py
g f e c b
c j b a d
$ python x.py
c e a a e
j f e a i
Notice that, in the first line of the second run random.choice()
randomly selected a
twice and e
twice. If these were filenames, it would appear that only 3 files were copied. In fact, five copies are performed, but the redundant copies don’t add to the file count. Of course the number of repeated identical choices is random, leading to the inconsistent behavior you see.
On the other hand, the second line, derived from random.sample()
will never have repeated elements, since random.sample()
chooses without replacement.