How to create a loop to takes an existing df, and creates a randomized new df
Question:
I am trying to build a tool that will essentially scramble a dataset while maintaining the same elements. For example, if I have the table below
1 2 3 4 5 6
0 ABC 1234 NL00 Paid VISA
1 BCD 2345 NL01 Unpaid AMEX
2 CDE 3456 NL02 Unpaid VISA
I want it to then look go through each column, pick a random value, and paste that into a new df. An example output would be
1 2 3 4 5 6
2 BCD 2345 NL01 Unpaid VISA
0 BCD 1234 NL02 Unpaid VISA
0 CDE 3456 NL01 Paid VISA
I have managed to make it work with the code below, although for 24 columns the code was quite repetitive and I know a loop should be able to do this much quicker, I just have not been able to make it work.
import pandas as pd
import random
lst1 = df['1'].to_list()
lst2 = df['2'].to_list()
lst3 = df['3'].to_list()
lst4 = df['4'].to_list()
lst5 = df['5'].to_list()
lst6 = df['6'].to_list()
df_new = pd.DataFrame()
df_new['1'] = random.choices(lst1, k=2000)
df_new['2'] = random.choices(lst2, k=2000)
df_new['3'] = random.choices(lst3, k=2000)
df_new['4'] = random.choices(lst4, k=2000)
df_new['5'] = random.choices(lst5, k=2000)
df_new['6'] = random.choices(lst6, k=2000)
Answers:
cols = list(df.columns)
for x in range(len(cols)):
lst = df[cols[x]].to_list()
colname = str(x+1)
df_new[colname] = random.choices(lst, k=2000)
Here’s a loop for you to iterate through the columns names. Something like this should work.
You can loop over the columns of the original dataframe and use sampling with replacement on each column to get the columns of the new dataframe.
df_new = pd.DataFrame()
for col_name in df.columns:
df_new[col_name] = df[col_name].sample(n=2000, replace=True).tolist()
print(df_new)
Here’s an easy solution:
df.apply(pd.Series.sample, replace=True, ignore_index=True, frac=1)
Output (potential):
1 2 3 4 5 6
0 2 CDE 3456 NL00 Paid VISA
1 2 BCD 3456 NL01 Paid VISA
2 0 CDE 3456 NL01 Paid VISA
pd.DataFrame.apply
applies pd.Series.sample
method to each column of the dataframe with resampling (replace=True
) and return 100% size of the original dataframe with frac=1
.
I am trying to build a tool that will essentially scramble a dataset while maintaining the same elements. For example, if I have the table below
1 2 3 4 5 6
0 ABC 1234 NL00 Paid VISA
1 BCD 2345 NL01 Unpaid AMEX
2 CDE 3456 NL02 Unpaid VISA
I want it to then look go through each column, pick a random value, and paste that into a new df. An example output would be
1 2 3 4 5 6
2 BCD 2345 NL01 Unpaid VISA
0 BCD 1234 NL02 Unpaid VISA
0 CDE 3456 NL01 Paid VISA
I have managed to make it work with the code below, although for 24 columns the code was quite repetitive and I know a loop should be able to do this much quicker, I just have not been able to make it work.
import pandas as pd
import random
lst1 = df['1'].to_list()
lst2 = df['2'].to_list()
lst3 = df['3'].to_list()
lst4 = df['4'].to_list()
lst5 = df['5'].to_list()
lst6 = df['6'].to_list()
df_new = pd.DataFrame()
df_new['1'] = random.choices(lst1, k=2000)
df_new['2'] = random.choices(lst2, k=2000)
df_new['3'] = random.choices(lst3, k=2000)
df_new['4'] = random.choices(lst4, k=2000)
df_new['5'] = random.choices(lst5, k=2000)
df_new['6'] = random.choices(lst6, k=2000)
cols = list(df.columns)
for x in range(len(cols)):
lst = df[cols[x]].to_list()
colname = str(x+1)
df_new[colname] = random.choices(lst, k=2000)
Here’s a loop for you to iterate through the columns names. Something like this should work.
You can loop over the columns of the original dataframe and use sampling with replacement on each column to get the columns of the new dataframe.
df_new = pd.DataFrame()
for col_name in df.columns:
df_new[col_name] = df[col_name].sample(n=2000, replace=True).tolist()
print(df_new)
Here’s an easy solution:
df.apply(pd.Series.sample, replace=True, ignore_index=True, frac=1)
Output (potential):
1 2 3 4 5 6
0 2 CDE 3456 NL00 Paid VISA
1 2 BCD 3456 NL01 Paid VISA
2 0 CDE 3456 NL01 Paid VISA
pd.DataFrame.apply
applies pd.Series.sample
method to each column of the dataframe with resampling (replace=True
) and return 100% size of the original dataframe with frac=1
.