Randomize strings from a list with constraints on the beginning of the string

Question:

I have a list RIR_list of filenames of the form number/filename. For example 3/foo. The numbers are integers from 1-30 in this case (without loss of generality).

I wish to choose a sub-list of n pairs out of the previous list. Each of the n pairs should have the same number at the beginning for both entries. Valid code for this is (if I have missed nothing):

#choose a random beginning for each pair    
room_nb = np.random.randint(30,size=n)+1
#iterate through pairs
for i in range(n): 
    #generate sublist containing only entries with the correct beginning for this iteration
    room_RIR = [rir for rir in RIR_list if rir.startswith(str(room_nb[i])+'/')] 
    #pick a random pair with the same header for this iteration
    chosen_RIR = random.choices(room_RIR, k=2)

If I only wished to randomize n entries, I could do with a one-liner random.choices(RIR_list, k=n) twice for pairs. Is there a way to do the fool job in a more elegant way? More importantly, maybe a lower computation?

P.S.
Pairs with the same file name are not allowed and each number so happens to contain the same amount of files, but if it was different, uniform distribution with respect to that number would be prefered, That is if it contains two files the probability would be 0.5 for each.

Asked By: havakok

||

Answers:

Instead of finding files with the same prefix each time you create a pair, you could group files by their prefix once and store them in a dictionary. This way, you can randomly select an entry from that dict and then a sample from that group.

import random
files = ["%02d/%03d" % (random.randint(0, 10), random.randint(100,999))
         for _ in range(100)]

grouped = {}
for f in files:
    grouped.setdefault(f.split("/")[0], []).append(f)
groups = list(grouped.values())

pairs = [random.sample(random.choice(groups), 2) for _ in range(3)]
# [['00/982', '00/123'],
#  ['04/644', '04/649'],
#  ['01/164', '01/316']]

This means, however, that each number will have the same probability, no matter how many files there are starting with that number. If you want the probabilities to reflect the number of files, you could randomly select a file, get the prefix, and then get the pair from the respective group.

n = random.choice(files).split("/")[0]
pair = random.sample(grouped[n], 2)
# ['00/866', '00/592']

(Using random.sample here for pairs with different parts; if you want to allow for pairs of same elements, use random.choices.)

Answered By: tobias_k
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.