Python weighted random choices from lists with different probability, comparing two Pandas DataFrame from CSVs

Question:

Python beginner, here. I am attempting to take a pandas DataFrame (created from a CSV) and use weighted random choices to choose from another DataFrame (created from a CSV). What I have is two pandas DataFrames that read something like this:

Weighted Percentages of codes:

SECTION CODE Final_Per
B1 800 5%
B1 801 65%
B1 802 30%
B2 900 30%
B2 901 70%
B3 600 50%
B3 601 50%

Input pandas DataFrame to run weighted percentages on:

SECTION NUMBER
B1 14
B2 25
B3 12

These are just examples of my tables rather than the entirety of the tables themselves. What I need to do is store these weighted probabilities whether in a dictionary, lists, or pandas dataframes (not sure what’s best) – and take my second table above and apply the ‘Final_Per’ %’s to the ‘NUMBER’ column and output the result. So B1’s result would be 14 values, 5% being code 800, 65% being code 801, and 30% being code 802. Currently, the tables are CSV’s and I am turning them into pandas dataframes and attempting to take some lessons learned from this article https://pynative.com/python-weighted-random-choices-with-probability/ to no success. Does anybody have suggestions on how to handle this correctly? Thank you.

Asked By: Connor Garrett

||

Answers:

If you reshape the CSV data into something like:

SECTION_COUNTS = {
    "B1": 14,
    "B2": 25,
    "B3": 12,
}

SECTION_DISTRIBUTIONS = {
    "B1": [
        {"code": 800, "from": 1, "to": 5},
        {"code": 801, "from": 6, "to": 70},
        {"code": 802, "from": 71, "to": 100}
    ],
    "B2": [
        {"code": 900, "from": 1, "to": 70},
        {"code": 901, "from": 71, "to": 100}
    ],
    "B3": [
        {"code": 600, "from": 1, "to": 50},
        {"code": 601, "from": 51, "to": 100}
    ]
}

The I think the answer you seek might be given by:

import random

results = {}
for section_id, count in SECTION_COUNTS.items():
    for _ in range(count):
        code = next(
            row["code"]
            for row
            in SECTION_DISTRIBUTIONS[section_id]
            if row["from"] <= random.randint(1, 100) <= row["to"]
        )
        results.setdefault(section_id, []).append(code)
print(results)

Resulting in something like:

{
    'B1': [801, 801, 802, 801, 801, 802, 800, 802, 802, 801, 802, 801, 800, 801],
    'B2': [900, 900, 900, 900, 900, 901, 900, 900, 901, 900, 900, 901, 901, 900, 900, 900, 901, 900, 901, 900, 900, 901, 900, 901, 900],
    'B3': [601, 601, 600, 600, 600, 601, 601, 601, 600, 601, 600, 600]
}

Addendum:

It was ask how one might go about reshaping a CSV like:

SECTION,CODE,Final_Per
B1,800,5
B1,801,65
B1,802,30
B2,900,30
B2,901,70
B3,600,50
B3,601,50

into our SECTION_DISTRIBUTIONS. Let’s take a look at that. The way I would initially tackle this would be to keep track on some information from the prior row of our csv as we iterate through the rows

import csv
import json

with open("in.csv", "r") as file_in:
    SECTION_DISTRIBUTIONS = {}
    prior_to = 0
    prior_section = ""

    for current_row in csv.DictReader(file_in):
        curr_section = current_row["SECTION"]

        if prior_section != curr_section:
            prior_section = curr_section
            prior_to = 0

        curr_code = current_row["CODE"]
        curr_from = prior_to + 1
        curr_to = prior_to + int(current_row["Final_Per"])

        target = SECTION_DISTRIBUTIONS.setdefault(curr_section, [])
        target.append({"code": curr_code, "from": curr_from, "to": curr_to})
        prior_to = curr_to

    print(json.dumps(SECTION_DISTRIBUTIONS, indent=4))

That should result in:

{
    "B1": [
        {
            "code": "800",
            "from": 1,
            "to": 5
        },
        {
            "code": "801",
            "from": 6,
            "to": 70
        },
        {
            "code": "802",
            "from": 71,
            "to": 100
        }
    ],
    "B2": [
        {
            "code": "900",
            "from": 1,
            "to": 30
        },
        {
            "code": "901",
            "from": 31,
            "to": 100
        }
    ],
    "B3": [
        {
            "code": "600",
            "from": 1,
            "to": 50
        },
        {
            "code": "601",
            "from": 51,
            "to": 100
        }
    ]
}
Answered By: JonSG

Another way of doing it is to load the csv files into dataframes, merge them and use .apply.

from numpy.random import choice

df1 = pd.read_csv(/path/to/csv1)
df2 = pd.read_csv(/path/to/csv2)

def calculate_distribution(mini_df): 
    prob = mini_df.Final_Per.str[:-1].astype(float) / 100
    return choice(mini_df.CODE.values, mini_df.NUMBER.values[0], p=prob)

distributions = df1.merge(df2, on='SECTION').groupby('SECTION').apply(calculate_distribution)
print(distributions)
Answered By: Brener Ramos