Getting a percentage of files in a folder

Question:

I wrote a script, and I assigned part of it to select random sample of 10% of the files in a dir and copy them to a new dir. This is my method below, but it gives less than 10% (~9.6%) each time, and never the same amount.

for x in range(int(len(files) *.1)):
    to_copy = choice(files)
    shutil.copy(os.path.join(subdir, to_copy), os.path.join(output_folder))

this gave

#files source       run 1        run 2
29841               2852         2845
1595                152          156
11324               1084         1082
Asked By: physlexic

||

Answers:

By calling random.choice() repeatedly, you are effectively choosing with replacement. This means that you might be choosing the same file twice, in separate trips around the loop.

Try random.sample() instead:

for to_copy in random.sample(files, int(len(files)*.1)):
    shutil.copy(...)

Consider this program:

import random

seq = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

for _ in range(5):
    i = random.choice(seq)
    print(i, end=' ')
print()
for i in random.sample(seq, 5):
    print(i, end=' ')
print()

Here are two runs of the program:

$ python x.py 
g f e c b 
c j b a d 
$ python x.py 
c e a a e 
j f e a i 

Notice that, in the first line of the second run random.choice() randomly selected a twice and e twice. If these were filenames, it would appear that only 3 files were copied. In fact, five copies are performed, but the redundant copies don’t add to the file count. Of course the number of repeated identical choices is random, leading to the inconsistent behavior you see.

On the other hand, the second line, derived from random.sample() will never have repeated elements, since random.sample() chooses without replacement.

Answered By: Robᵩ
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.