Python weighted random choices from lists with different probability, comparing two Pandas DataFrame from CSVs

Question

Python beginner, here. I am attempting to take a pandas DataFrame (created from a CSV) and use weighted random choices to choose from another DataFrame (created from a CSV). What I have is two pandas DataFrames that read something like this:

Weighted Percentages of codes:

SECTION	CODE	Final_Per
B1	800	5%
B1	801	65%
B1	802	30%
B2	900	30%
B2	901	70%
B3	600	50%
B3	601	50%

Input pandas DataFrame to run weighted percentages on:

SECTION	NUMBER
B1	14
B2	25
B3	12

These are just examples of my tables rather than the entirety of the tables themselves. What I need to do is store these weighted probabilities whether in a dictionary, lists, or pandas dataframes (not sure what’s best) – and take my second table above and apply the ‘Final_Per’ %’s to the ‘NUMBER’ column and output the result. So B1’s result would be 14 values, 5% being code 800, 65% being code 801, and 30% being code 802. Currently, the tables are CSV’s and I am turning them into pandas dataframes and attempting to take some lessons learned from this article https://pynative.com/python-weighted-random-choices-with-probability/ to no success. Does anybody have suggestions on how to handle this correctly? Thank you.

Asked By: Connor Garrett

||

Source

Answer 1

If you reshape the CSV data into something like:

SECTION_COUNTS = {
    "B1": 14,
    "B2": 25,
    "B3": 12,
}

SECTION_DISTRIBUTIONS = {
    "B1": [
        {"code": 800, "from": 1, "to": 5},
        {"code": 801, "from": 6, "to": 70},
        {"code": 802, "from": 71, "to": 100}
    ],
    "B2": [
        {"code": 900, "from": 1, "to": 70},
        {"code": 901, "from": 71, "to": 100}
    ],
    "B3": [
        {"code": 600, "from": 1, "to": 50},
        {"code": 601, "from": 51, "to": 100}
    ]
}

The I think the answer you seek might be given by:

import random

results = {}
for section_id, count in SECTION_COUNTS.items():
    for _ in range(count):
        code = next(
            row["code"]
            for row
            in SECTION_DISTRIBUTIONS[section_id]
            if row["from"] <= random.randint(1, 100) <= row["to"]
        )
        results.setdefault(section_id, []).append(code)
print(results)

Resulting in something like:

{
    'B1': [801, 801, 802, 801, 801, 802, 800, 802, 802, 801, 802, 801, 800, 801],
    'B2': [900, 900, 900, 900, 900, 901, 900, 900, 901, 900, 900, 901, 901, 900, 900, 900, 901, 900, 901, 900, 900, 901, 900, 901, 900],
    'B3': [601, 601, 600, 600, 600, 601, 601, 601, 600, 601, 600, 600]
}

Addendum:

It was ask how one might go about reshaping a CSV like:

SECTION,CODE,Final_Per
B1,800,5
B1,801,65
B1,802,30
B2,900,30
B2,901,70
B3,600,50
B3,601,50

into our SECTION_DISTRIBUTIONS. Let’s take a look at that. The way I would initially tackle this would be to keep track on some information from the prior row of our csv as we iterate through the rows

import csv
import json

with open("in.csv", "r") as file_in:
    SECTION_DISTRIBUTIONS = {}
    prior_to = 0
    prior_section = ""

    for current_row in csv.DictReader(file_in):
        curr_section = current_row["SECTION"]

        if prior_section != curr_section:
            prior_section = curr_section
            prior_to = 0

        curr_code = current_row["CODE"]
        curr_from = prior_to + 1
        curr_to = prior_to + int(current_row["Final_Per"])

        target = SECTION_DISTRIBUTIONS.setdefault(curr_section, [])
        target.append({"code": curr_code, "from": curr_from, "to": curr_to})
        prior_to = curr_to

    print(json.dumps(SECTION_DISTRIBUTIONS, indent=4))

That should result in:

{
    "B1": [
        {
            "code": "800",
            "from": 1,
            "to": 5
        },
        {
            "code": "801",
            "from": 6,
            "to": 70
        },
        {
            "code": "802",
            "from": 71,
            "to": 100
        }
    ],
    "B2": [
        {
            "code": "900",
            "from": 1,
            "to": 30
        },
        {
            "code": "901",
            "from": 31,
            "to": 100
        }
    ],
    "B3": [
        {
            "code": "600",
            "from": 1,
            "to": 50
        },
        {
            "code": "601",
            "from": 51,
            "to": 100
        }
    ]
}

Answered By: JonSG

Answer 2

Another way of doing it is to load the csv files into dataframes, merge them and use .apply.

from numpy.random import choice

df1 = pd.read_csv(/path/to/csv1)
df2 = pd.read_csv(/path/to/csv2)

def calculate_distribution(mini_df): 
    prob = mini_df.Final_Per.str[:-1].astype(float) / 100
    return choice(mini_df.CODE.values, mini_df.NUMBER.values[0], p=prob)

distributions = df1.merge(df2, on='SECTION').groupby('SECTION').apply(calculate_distribution)
print(distributions)

Answered By: Brener Ramos

Python weighted random choices from lists with different probability, comparing two Pandas DataFrame from CSVs

Question:

Answers: