Dataframe list comprehension "zip(…)": loop through chosen df columns efficiently with just a list of column name strings
Question:
This is just a nitpicking syntactic question…
I have a dataframe, and I want to use list comprehension to evaluate a function using lots of columns.
I know I can do this
df['result_col'] = [some_func(*var) for var in zip(df['col_1'], df['col_2'],... ,df['col_n'])]
I would like to do something like this
df['result_col'] = [some_func(*var) for var in zip(df[['col_1', 'col_2',... ,'col_n']])]
i.e. not having to write df
n
times. I cannot for the life of me figure out the syntax.
Answers:
this should work, but honestly, OP figured it himself as well, so +1 OP 🙂
df['result_col'] = [some_func(*var) for var in zip(*[df[col] for col in ['col_1', 'col_2',... ,'col_n']])]
As mentioned in the comments above, you should use apply
instead:
df['reult_col'] = df.apply(lambda x: some_func(*tuple(x.values)), axis=1)
df.apply() is almost as slow as df.iterrows(), both are not recommended, see How to iterate over rows in a DataFrame in Pandas –> search for "An Obvious Example" of @cs95a and see the comparison graph. As the fastest ways (vectorization, Cython routines) are not easy to implement, the 3rd best and thus usually best solution is list comprehension:
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(*row) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
or
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(row[0]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
or
# print 3rd col
def some_func(x):
print(x)
df['result_col'] = [some_func(row[0][2]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
Further reading:
- Memory efficient way for list comprehension of pandas dataframe using multiple columns
- list comprehension in pandas
- What is the most efficient way to loop through dataframes with pandas?
- Loop through dataframe one by one (pandas)
EDIT:
Please use df.iloc
and df.loc
instead of df[[...]]
, see Selecting multiple columns in a Pandas dataframe
This is just a nitpicking syntactic question…
I have a dataframe, and I want to use list comprehension to evaluate a function using lots of columns.
I know I can do this
df['result_col'] = [some_func(*var) for var in zip(df['col_1'], df['col_2'],... ,df['col_n'])]
I would like to do something like this
df['result_col'] = [some_func(*var) for var in zip(df[['col_1', 'col_2',... ,'col_n']])]
i.e. not having to write df
n
times. I cannot for the life of me figure out the syntax.
this should work, but honestly, OP figured it himself as well, so +1 OP 🙂
df['result_col'] = [some_func(*var) for var in zip(*[df[col] for col in ['col_1', 'col_2',... ,'col_n']])]
As mentioned in the comments above, you should use apply
instead:
df['reult_col'] = df.apply(lambda x: some_func(*tuple(x.values)), axis=1)
df.apply() is almost as slow as df.iterrows(), both are not recommended, see How to iterate over rows in a DataFrame in Pandas –> search for "An Obvious Example" of @cs95a and see the comparison graph. As the fastest ways (vectorization, Cython routines) are not easy to implement, the 3rd best and thus usually best solution is list comprehension:
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(*row) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
or
# print 3rd col
def some_func(row):
print(row[2])
df['result_col'] = [some_func(row[0]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
or
# print 3rd col
def some_func(x):
print(x)
df['result_col'] = [some_func(row[0][2]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]
Further reading:
- Memory efficient way for list comprehension of pandas dataframe using multiple columns
- list comprehension in pandas
- What is the most efficient way to loop through dataframes with pandas?
- Loop through dataframe one by one (pandas)
EDIT:
Please use df.iloc
and df.loc
instead of df[[...]]
, see Selecting multiple columns in a Pandas dataframe