Python weighted random choices from lists with different probability, comparing two Pandas DataFrame from CSVs
Question:
Python beginner, here. I am attempting to take a pandas DataFrame (created from a CSV) and use weighted random choices to choose from another DataFrame (created from a CSV). What I have is two pandas DataFrames that read something like this:
Weighted Percentages of codes:
SECTION
CODE
Final_Per
B1
800
5%
B1
801
65%
B1
802
30%
B2
900
30%
B2
901
70%
B3
600
50%
B3
601
50%
Input pandas DataFrame to run weighted percentages on:
SECTION
NUMBER
B1
14
B2
25
B3
12
These are just examples of my tables rather than the entirety of the tables themselves. What I need to do is store these weighted probabilities whether in a dictionary, lists, or pandas dataframes (not sure what’s best) – and take my second table above and apply the ‘Final_Per’ %’s to the ‘NUMBER’ column and output the result. So B1’s result would be 14 values, 5% being code 800, 65% being code 801, and 30% being code 802. Currently, the tables are CSV’s and I am turning them into pandas dataframes and attempting to take some lessons learned from this article https://pynative.com/python-weighted-random-choices-with-probability/ to no success. Does anybody have suggestions on how to handle this correctly? Thank you.
Answers:
If you reshape the CSV data into something like:
SECTION_COUNTS = {
"B1": 14,
"B2": 25,
"B3": 12,
}
SECTION_DISTRIBUTIONS = {
"B1": [
{"code": 800, "from": 1, "to": 5},
{"code": 801, "from": 6, "to": 70},
{"code": 802, "from": 71, "to": 100}
],
"B2": [
{"code": 900, "from": 1, "to": 70},
{"code": 901, "from": 71, "to": 100}
],
"B3": [
{"code": 600, "from": 1, "to": 50},
{"code": 601, "from": 51, "to": 100}
]
}
The I think the answer you seek might be given by:
import random
results = {}
for section_id, count in SECTION_COUNTS.items():
for _ in range(count):
code = next(
row["code"]
for row
in SECTION_DISTRIBUTIONS[section_id]
if row["from"] <= random.randint(1, 100) <= row["to"]
)
results.setdefault(section_id, []).append(code)
print(results)
Resulting in something like:
{
'B1': [801, 801, 802, 801, 801, 802, 800, 802, 802, 801, 802, 801, 800, 801],
'B2': [900, 900, 900, 900, 900, 901, 900, 900, 901, 900, 900, 901, 901, 900, 900, 900, 901, 900, 901, 900, 900, 901, 900, 901, 900],
'B3': [601, 601, 600, 600, 600, 601, 601, 601, 600, 601, 600, 600]
}
Addendum:
It was ask how one might go about reshaping a CSV like:
SECTION,CODE,Final_Per
B1,800,5
B1,801,65
B1,802,30
B2,900,30
B2,901,70
B3,600,50
B3,601,50
into our SECTION_DISTRIBUTIONS
. Let’s take a look at that. The way I would initially tackle this would be to keep track on some information from the prior row of our csv as we iterate through the rows
import csv
import json
with open("in.csv", "r") as file_in:
SECTION_DISTRIBUTIONS = {}
prior_to = 0
prior_section = ""
for current_row in csv.DictReader(file_in):
curr_section = current_row["SECTION"]
if prior_section != curr_section:
prior_section = curr_section
prior_to = 0
curr_code = current_row["CODE"]
curr_from = prior_to + 1
curr_to = prior_to + int(current_row["Final_Per"])
target = SECTION_DISTRIBUTIONS.setdefault(curr_section, [])
target.append({"code": curr_code, "from": curr_from, "to": curr_to})
prior_to = curr_to
print(json.dumps(SECTION_DISTRIBUTIONS, indent=4))
That should result in:
{
"B1": [
{
"code": "800",
"from": 1,
"to": 5
},
{
"code": "801",
"from": 6,
"to": 70
},
{
"code": "802",
"from": 71,
"to": 100
}
],
"B2": [
{
"code": "900",
"from": 1,
"to": 30
},
{
"code": "901",
"from": 31,
"to": 100
}
],
"B3": [
{
"code": "600",
"from": 1,
"to": 50
},
{
"code": "601",
"from": 51,
"to": 100
}
]
}
Another way of doing it is to load the csv files into dataframes, merge them and use .apply
.
from numpy.random import choice
df1 = pd.read_csv(/path/to/csv1)
df2 = pd.read_csv(/path/to/csv2)
def calculate_distribution(mini_df):
prob = mini_df.Final_Per.str[:-1].astype(float) / 100
return choice(mini_df.CODE.values, mini_df.NUMBER.values[0], p=prob)
distributions = df1.merge(df2, on='SECTION').groupby('SECTION').apply(calculate_distribution)
print(distributions)
Python beginner, here. I am attempting to take a pandas DataFrame (created from a CSV) and use weighted random choices to choose from another DataFrame (created from a CSV). What I have is two pandas DataFrames that read something like this:
Weighted Percentages of codes:
SECTION | CODE | Final_Per |
---|---|---|
B1 | 800 | 5% |
B1 | 801 | 65% |
B1 | 802 | 30% |
B2 | 900 | 30% |
B2 | 901 | 70% |
B3 | 600 | 50% |
B3 | 601 | 50% |
Input pandas DataFrame to run weighted percentages on:
SECTION | NUMBER |
---|---|
B1 | 14 |
B2 | 25 |
B3 | 12 |
These are just examples of my tables rather than the entirety of the tables themselves. What I need to do is store these weighted probabilities whether in a dictionary, lists, or pandas dataframes (not sure what’s best) – and take my second table above and apply the ‘Final_Per’ %’s to the ‘NUMBER’ column and output the result. So B1’s result would be 14 values, 5% being code 800, 65% being code 801, and 30% being code 802. Currently, the tables are CSV’s and I am turning them into pandas dataframes and attempting to take some lessons learned from this article https://pynative.com/python-weighted-random-choices-with-probability/ to no success. Does anybody have suggestions on how to handle this correctly? Thank you.
If you reshape the CSV data into something like:
SECTION_COUNTS = {
"B1": 14,
"B2": 25,
"B3": 12,
}
SECTION_DISTRIBUTIONS = {
"B1": [
{"code": 800, "from": 1, "to": 5},
{"code": 801, "from": 6, "to": 70},
{"code": 802, "from": 71, "to": 100}
],
"B2": [
{"code": 900, "from": 1, "to": 70},
{"code": 901, "from": 71, "to": 100}
],
"B3": [
{"code": 600, "from": 1, "to": 50},
{"code": 601, "from": 51, "to": 100}
]
}
The I think the answer you seek might be given by:
import random
results = {}
for section_id, count in SECTION_COUNTS.items():
for _ in range(count):
code = next(
row["code"]
for row
in SECTION_DISTRIBUTIONS[section_id]
if row["from"] <= random.randint(1, 100) <= row["to"]
)
results.setdefault(section_id, []).append(code)
print(results)
Resulting in something like:
{
'B1': [801, 801, 802, 801, 801, 802, 800, 802, 802, 801, 802, 801, 800, 801],
'B2': [900, 900, 900, 900, 900, 901, 900, 900, 901, 900, 900, 901, 901, 900, 900, 900, 901, 900, 901, 900, 900, 901, 900, 901, 900],
'B3': [601, 601, 600, 600, 600, 601, 601, 601, 600, 601, 600, 600]
}
Addendum:
It was ask how one might go about reshaping a CSV like:
SECTION,CODE,Final_Per
B1,800,5
B1,801,65
B1,802,30
B2,900,30
B2,901,70
B3,600,50
B3,601,50
into our SECTION_DISTRIBUTIONS
. Let’s take a look at that. The way I would initially tackle this would be to keep track on some information from the prior row of our csv as we iterate through the rows
import csv
import json
with open("in.csv", "r") as file_in:
SECTION_DISTRIBUTIONS = {}
prior_to = 0
prior_section = ""
for current_row in csv.DictReader(file_in):
curr_section = current_row["SECTION"]
if prior_section != curr_section:
prior_section = curr_section
prior_to = 0
curr_code = current_row["CODE"]
curr_from = prior_to + 1
curr_to = prior_to + int(current_row["Final_Per"])
target = SECTION_DISTRIBUTIONS.setdefault(curr_section, [])
target.append({"code": curr_code, "from": curr_from, "to": curr_to})
prior_to = curr_to
print(json.dumps(SECTION_DISTRIBUTIONS, indent=4))
That should result in:
{
"B1": [
{
"code": "800",
"from": 1,
"to": 5
},
{
"code": "801",
"from": 6,
"to": 70
},
{
"code": "802",
"from": 71,
"to": 100
}
],
"B2": [
{
"code": "900",
"from": 1,
"to": 30
},
{
"code": "901",
"from": 31,
"to": 100
}
],
"B3": [
{
"code": "600",
"from": 1,
"to": 50
},
{
"code": "601",
"from": 51,
"to": 100
}
]
}
Another way of doing it is to load the csv files into dataframes, merge them and use .apply
.
from numpy.random import choice
df1 = pd.read_csv(/path/to/csv1)
df2 = pd.read_csv(/path/to/csv2)
def calculate_distribution(mini_df):
prob = mini_df.Final_Per.str[:-1].astype(float) / 100
return choice(mini_df.CODE.values, mini_df.NUMBER.values[0], p=prob)
distributions = df1.merge(df2, on='SECTION').groupby('SECTION').apply(calculate_distribution)
print(distributions)