How can I assign several consecutive values to a dataframe in a loop?
Question:
I am looping through a dataframe in which the "pers_function" column has several values in each cell (separated by a comma) describing people’s occupations. I want to duplicate each row and write only ONE profession to each cell in the "pers_function" column.
Unfortunately, the result has only the last value of each cell but displays this multiple times.
So if one row in the input file has Assessor, Prokurator
in "pers_function", I get this as the output:
The code I am using is this:
df2 = df_unique
df_size=len(df2)
# find cells with commas in the pers_function column
list_to_append=[]
try:
for x in range(0, df_size):
print(df_size - x)
e_df=df2.iloc[[x]].fillna("n/a") # virtual value to avoid issues with empty data frames
if "," in e_df['pers_function'].values[0]:
e_functions=e_df['pers_function'].values[0]
function_list=e_functions.split(", ")
for y in range(0, len(function_list)):
function=function_list[y]
print(function)
e_df["pers_function"]=function
e_df["factoid_ID"]="split_factoid"
#print(e_df)
list_to_append.append(e_df)
else:
print("Only one value found.")
print(len(list_to_append))
except Exception as e:
print(e)
df_split = pd.concat(list_to_append, axis=0, ignore_index=True, sort=False)
display(df_split)
Repeatedly assigning new values in my loop does not work, but I do not know why. Looking at the values that were added to the list of dataframes, they are all correct. The problem only seems to occur when I write the list of dataframes to one new dataframe.
Answers:
Instead of your code, you should try this:
import pandas as pd
df = pd.DataFrame({
'pers_function': ['a,b,c', 'a', 'a,b'],
'feature': [1,2,3]}
)
df['pers_function'] = df['pers_function'].str.split(',')
print(df.explode('pers_function'))
Result:
pers_function feature
0 a 1
0 b 1
0 c 1
1 a 2
2 a 3
2 b 3
In general, it is not wise to use a loop in python to iterate over dataframes. It is way faster to use vectorized functions.
I am looping through a dataframe in which the "pers_function" column has several values in each cell (separated by a comma) describing people’s occupations. I want to duplicate each row and write only ONE profession to each cell in the "pers_function" column.
Unfortunately, the result has only the last value of each cell but displays this multiple times.
So if one row in the input file has Assessor, Prokurator
in "pers_function", I get this as the output:
The code I am using is this:
df2 = df_unique
df_size=len(df2)
# find cells with commas in the pers_function column
list_to_append=[]
try:
for x in range(0, df_size):
print(df_size - x)
e_df=df2.iloc[[x]].fillna("n/a") # virtual value to avoid issues with empty data frames
if "," in e_df['pers_function'].values[0]:
e_functions=e_df['pers_function'].values[0]
function_list=e_functions.split(", ")
for y in range(0, len(function_list)):
function=function_list[y]
print(function)
e_df["pers_function"]=function
e_df["factoid_ID"]="split_factoid"
#print(e_df)
list_to_append.append(e_df)
else:
print("Only one value found.")
print(len(list_to_append))
except Exception as e:
print(e)
df_split = pd.concat(list_to_append, axis=0, ignore_index=True, sort=False)
display(df_split)
Repeatedly assigning new values in my loop does not work, but I do not know why. Looking at the values that were added to the list of dataframes, they are all correct. The problem only seems to occur when I write the list of dataframes to one new dataframe.
Instead of your code, you should try this:
import pandas as pd
df = pd.DataFrame({
'pers_function': ['a,b,c', 'a', 'a,b'],
'feature': [1,2,3]}
)
df['pers_function'] = df['pers_function'].str.split(',')
print(df.explode('pers_function'))
Result:
pers_function feature
0 a 1
0 b 1
0 c 1
1 a 2
2 a 3
2 b 3
In general, it is not wise to use a loop in python to iterate over dataframes. It is way faster to use vectorized functions.