How can I assign several consecutive values to a dataframe in a loop?

Question:

I am looping through a dataframe in which the "pers_function" column has several values in each cell (separated by a comma) describing people’s occupations. I want to duplicate each row and write only ONE profession to each cell in the "pers_function" column.

Unfortunately, the result has only the last value of each cell but displays this multiple times.

So if one row in the input file has Assessor, Prokurator in "pers_function", I get this as the output:

enter image description here

The code I am using is this:

df2 = df_unique
df_size=len(df2)

# find cells with commas in the pers_function column
list_to_append=[]
try:
  for x in range(0, df_size):
      print(df_size - x)
      e_df=df2.iloc[[x]].fillna("n/a") # virtual value to avoid issues with empty data frames
  
      if "," in e_df['pers_function'].values[0]:
        e_functions=e_df['pers_function'].values[0]
        function_list=e_functions.split(", ")
        for y in range(0, len(function_list)):
          function=function_list[y]
          print(function)
          e_df["pers_function"]=function
          e_df["factoid_ID"]="split_factoid"
          #print(e_df)
          list_to_append.append(e_df)

      else:
        print("Only one value found.")
  
  print(len(list_to_append))
 
except Exception as e:
  print(e)



df_split = pd.concat(list_to_append, axis=0, ignore_index=True, sort=False)
display(df_split)

Repeatedly assigning new values in my loop does not work, but I do not know why. Looking at the values that were added to the list of dataframes, they are all correct. The problem only seems to occur when I write the list of dataframes to one new dataframe.

Asked By: OnceUponATime

||

Answers:

Instead of your code, you should try this:

import pandas as pd

df = pd.DataFrame({
    'pers_function': ['a,b,c', 'a', 'a,b'],
    'feature': [1,2,3]}
)

df['pers_function'] = df['pers_function'].str.split(',')
print(df.explode('pers_function'))

Result:

  pers_function  feature
0             a        1
0             b        1
0             c        1
1             a        2
2             a        3
2             b        3

In general, it is not wise to use a loop in python to iterate over dataframes. It is way faster to use vectorized functions.

Answered By: Lukas Hestermeyer
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.