Best way to choose a random file from a directory
Question:
What is the best way to choose a random file from a directory in Python?
Edit: Here is what I am doing:
import os
import random
import dircache
dir = 'some/directory'
filename = random.choice(dircache.listdir(dir))
path = os.path.join(dir, filename)
Is this particularly bad, or is there a particularly better way?
Answers:
Independant from the language used, you can read all references to the files in a directory into a datastructure like an array (something like ‘listFiles’), get the length of the array. calculate a random number in the range of ‘0’ to ‘arrayLength-1’ and access the file at the certain index. This should work, not only in python.
If you don’t know before hand what files are there, you will need to get a list, then just pick a random index in the list.
Here’s one attempt:
import os
import random
def getRandomFile(path):
"""
Returns a random filename, chosen among the files of the given path.
"""
files = os.listdir(path)
index = random.randrange(0, len(files))
return files[index]
EDIT: The question now mentions a fear of a “race condition”, which I can only assume is the typical problem of files being added/removed while you are in the process of trying to pick a random file.
I don’t believe there is a way around that, other than keeping in mind that any I/O operation is inherently “unsafe”, i.e. it can fail. So, the algorithm to open a randomly chosen file in a given directory should:
- Actually
open()
the file selected, and handle a failure, since the file might no longer be there
- Probably limit itself to a set number of tries, so it doesn’t die if the directory is empty or if none of the files are readable
import os, random
random.choice(os.listdir("C:\")) #change dir name to whatever
Regarding your edited question: first, I assume you know the risks of using a dircache
, as well as the fact that it is deprecated since 2.6, and removed in 3.0.
Second of all, I don’t see where any race condition exists here. Your dircache
object is basically immutable (after directory listing is cached, it is never read again), so no harm in concurrent reads from it.
Other than that, I do not understand why you see any problem with this solution. It is fine.
Language agnostic solution:
1) Get the total no. of files in specified directory.
2) Pick a random number from 0 to [total no. of files – 1].
3) Get the list of filenames as a suitably indexed collection or such.
4) Pick the nth element, where n is the random number.
If you want directories included, Yuval A’s answer. Otherwise:
import os, random
random.choice([x for x in os.listdir("C:\") if os.path.isfile(os.path.join("C:\", x))])
The problem with most of the solutions given is you load all your input into memory which can become a problem for large inputs/hierarchies. Here’s a solution adapted from The Perl Cookbook by Tom Christiansen and Nat Torkington. To get a random file anywhere beneath a directory:
#! /usr/bin/env python
import os, random
n=0
random.seed();
for root, dirs, files in os.walk('/tmp/foo'):
for name in files:
n += 1
if random.uniform(0, n) < 1:
rfile=os.path.join(root, name)
print rfile
Generalizing a bit makes a handy script:
$ cat /tmp/randy.py
#! /usr/bin/env python
import sys, random
random.seed()
n = 1
for line in sys.stdin:
if random.uniform(0, n) < 1:
rline=line
n += 1
sys.stdout.write(rline)
$ /tmp/randy.py < /usr/share/dict/words
chrysochlore
$ find /tmp/foo -type f | /tmp/randy.py
/tmp/foo/bar
The simplest solution is to make use of os.listdir & random.choice methods
random_file=random.choice(os.listdir("Folder_Destination"))
Let’s take a look at it step by step :-
1} os.listdir method returns the list containing the name of
entries (files) in the path specified.
2} This list is then passed as a parameter to random.choice method
which returns a random file name from the list.
3} The file name is stored in random_file variable.
Considering a real time application
Here’s a sample python code which will move random files from one directory to another
import os, random, shutil
#Prompting user to enter number of files to select randomly along with directory
source=input("Enter the Source Directory : ")
dest=input("Enter the Destination Directory : ")
no_of_files=int(input("Enter The Number of Files To Select : "))
print("%"*25+"{ Details Of Transfer }"+"%"*25)
print("nnList of Files Moved to %s :-"%(dest))
#Using for loop to randomly choose multiple files
for i in range(no_of_files):
#Variable random_file stores the name of the random file chosen
random_file=random.choice(os.listdir(source))
print("%d} %s"%(i+1,random_file))
source_file="%s%s"%(source,random_file)
dest_file=dest
#"shutil.move" function moves file from one directory to another
shutil.move(source_file,dest_file)
print("nn"+"$"*33+"[ Files Moved Successfully ]"+"$"*33)
You can check out the whole project on github
Random File Picker
For addition reference about os.listdir & random.choice method you can refer to tutorialspoint learn python
os.listdir :- Python listdir() method
random.choice :- Python choice() method
Python 3 has the pathlib module, which can be used to reason about files and directories in a more object oriented fashion:
from random import choice
from pathlib import Path
path: Path = Path()
# The Path.iterdir method returns a generator, so we must convert it to a list
# before passing it to random.choice, which expects an iterable.
random_path = choice(list(path.iterdir()))
For those who come here with the need to pick a large number of files from a larger number of files, and maybe copy or move them in another dir, the proposed approach is of course too slow.
Having enough memory, one could read all the directory content in a list, and then use the random.choices
function to select 17 elements, for example:
from random import choices
from glob import glob
from shutil import copy
file_list = glob([SRC DIR] + '*' + [FILE EXTENSION])
picked_files = choices(file_list, k=17)
now picked_files
is a list of 20 filenames picked at random, that can be copied/moved even in parallel, for example:
import multiprocessing as mp
from itertools import repeat
from shutil import copy
def copy_files(filename, dest):
print(f"Working on file: {filename}")
copy(filename, dest)
with mp.Pool(processes=(mp.cpu_count() - 1) or 1) as p:
p.starmap(copy_files, zip(picked_files, repeat([DEST PATH])))
What is the best way to choose a random file from a directory in Python?
Edit: Here is what I am doing:
import os
import random
import dircache
dir = 'some/directory'
filename = random.choice(dircache.listdir(dir))
path = os.path.join(dir, filename)
Is this particularly bad, or is there a particularly better way?
Independant from the language used, you can read all references to the files in a directory into a datastructure like an array (something like ‘listFiles’), get the length of the array. calculate a random number in the range of ‘0’ to ‘arrayLength-1’ and access the file at the certain index. This should work, not only in python.
If you don’t know before hand what files are there, you will need to get a list, then just pick a random index in the list.
Here’s one attempt:
import os
import random
def getRandomFile(path):
"""
Returns a random filename, chosen among the files of the given path.
"""
files = os.listdir(path)
index = random.randrange(0, len(files))
return files[index]
EDIT: The question now mentions a fear of a “race condition”, which I can only assume is the typical problem of files being added/removed while you are in the process of trying to pick a random file.
I don’t believe there is a way around that, other than keeping in mind that any I/O operation is inherently “unsafe”, i.e. it can fail. So, the algorithm to open a randomly chosen file in a given directory should:
- Actually
open()
the file selected, and handle a failure, since the file might no longer be there - Probably limit itself to a set number of tries, so it doesn’t die if the directory is empty or if none of the files are readable
import os, random
random.choice(os.listdir("C:\")) #change dir name to whatever
Regarding your edited question: first, I assume you know the risks of using a dircache
, as well as the fact that it is deprecated since 2.6, and removed in 3.0.
Second of all, I don’t see where any race condition exists here. Your dircache
object is basically immutable (after directory listing is cached, it is never read again), so no harm in concurrent reads from it.
Other than that, I do not understand why you see any problem with this solution. It is fine.
Language agnostic solution:
1) Get the total no. of files in specified directory.
2) Pick a random number from 0 to [total no. of files – 1].
3) Get the list of filenames as a suitably indexed collection or such.
4) Pick the nth element, where n is the random number.
If you want directories included, Yuval A’s answer. Otherwise:
import os, random
random.choice([x for x in os.listdir("C:\") if os.path.isfile(os.path.join("C:\", x))])
The problem with most of the solutions given is you load all your input into memory which can become a problem for large inputs/hierarchies. Here’s a solution adapted from The Perl Cookbook by Tom Christiansen and Nat Torkington. To get a random file anywhere beneath a directory:
#! /usr/bin/env python
import os, random
n=0
random.seed();
for root, dirs, files in os.walk('/tmp/foo'):
for name in files:
n += 1
if random.uniform(0, n) < 1:
rfile=os.path.join(root, name)
print rfile
Generalizing a bit makes a handy script:
$ cat /tmp/randy.py
#! /usr/bin/env python
import sys, random
random.seed()
n = 1
for line in sys.stdin:
if random.uniform(0, n) < 1:
rline=line
n += 1
sys.stdout.write(rline)
$ /tmp/randy.py < /usr/share/dict/words
chrysochlore
$ find /tmp/foo -type f | /tmp/randy.py
/tmp/foo/bar
The simplest solution is to make use of os.listdir & random.choice methods
random_file=random.choice(os.listdir("Folder_Destination"))
Let’s take a look at it step by step :-
1} os.listdir method returns the list containing the name of
entries (files) in the path specified.2} This list is then passed as a parameter to random.choice method
which returns a random file name from the list.3} The file name is stored in random_file variable.
Considering a real time application
Here’s a sample python code which will move random files from one directory to another
import os, random, shutil
#Prompting user to enter number of files to select randomly along with directory
source=input("Enter the Source Directory : ")
dest=input("Enter the Destination Directory : ")
no_of_files=int(input("Enter The Number of Files To Select : "))
print("%"*25+"{ Details Of Transfer }"+"%"*25)
print("nnList of Files Moved to %s :-"%(dest))
#Using for loop to randomly choose multiple files
for i in range(no_of_files):
#Variable random_file stores the name of the random file chosen
random_file=random.choice(os.listdir(source))
print("%d} %s"%(i+1,random_file))
source_file="%s%s"%(source,random_file)
dest_file=dest
#"shutil.move" function moves file from one directory to another
shutil.move(source_file,dest_file)
print("nn"+"$"*33+"[ Files Moved Successfully ]"+"$"*33)
You can check out the whole project on github
Random File Picker
For addition reference about os.listdir & random.choice method you can refer to tutorialspoint learn python
os.listdir :- Python listdir() method
random.choice :- Python choice() method
Python 3 has the pathlib module, which can be used to reason about files and directories in a more object oriented fashion:
from random import choice
from pathlib import Path
path: Path = Path()
# The Path.iterdir method returns a generator, so we must convert it to a list
# before passing it to random.choice, which expects an iterable.
random_path = choice(list(path.iterdir()))
For those who come here with the need to pick a large number of files from a larger number of files, and maybe copy or move them in another dir, the proposed approach is of course too slow.
Having enough memory, one could read all the directory content in a list, and then use the random.choices
function to select 17 elements, for example:
from random import choices
from glob import glob
from shutil import copy
file_list = glob([SRC DIR] + '*' + [FILE EXTENSION])
picked_files = choices(file_list, k=17)
now picked_files
is a list of 20 filenames picked at random, that can be copied/moved even in parallel, for example:
import multiprocessing as mp
from itertools import repeat
from shutil import copy
def copy_files(filename, dest):
print(f"Working on file: {filename}")
copy(filename, dest)
with mp.Pool(processes=(mp.cpu_count() - 1) or 1) as p:
p.starmap(copy_files, zip(picked_files, repeat([DEST PATH])))