list comprehension in pandas
Question:
I’m giving a toy example but it will help me understand what’s going on for something else I’m trying to do. Let’s say I want a new column in a dataframe ‘optimal_fruit’ that is apples * orange – bananas.
I can do something like this to get it.
df2['optimal_fruit'] = df2['apples'] * df2['oranges'] - df2['bananas']
apples oranges bananas optimal_fruit
1 6 11 -5
2 7 12 2
3 8 13 11
4 9 14 22
5 10 15 35
What is happening if I try to do something like this? And how could I do this in a list comprehension?
df2['optimal_fruit'] = [x * y - z for x in df2['apples'] for y in df2['oranges'] for z in df2['bananas']]
I get an error of:
ValueError: Length of values does not match length of index
As always, thank you all so much for your help!
Answers:
The reason why your new method doesn’t work is because the list comprehension produces data that is longer than the number of indices in your dataframe. A quick fix for that would be something like:
[x * y - z for x,y,z in zip(df2['apples'], df2['oranges'], df2['bananas'])]
Essentially your list comprehension statement is a set of 3 nested loops. In code:
l = []
for x in df2['apples']:
for y in df2['oranges']:
for z in df2['bananas']:
l.append(x * y - z)
The length of your resultant list will be power-of-3 times the length of your DataFrame (5x5x5 = 125). Hence the error. To fix, you need the equivalent of:
for x, y, z in zip(df2['apples'], df2['oranges'], df2['bananas']):
l.extend([x * y - z])
In terms of list comprehension:
[x * y - z for x, y, z in zip(df2['apples'], df2['oranges'], df2['bananas'])]
If you do not want to repeat df2 for each column:
[row[0][0]*row[0][1]-row[0][2] for row in zip(df2[['apples', 'oranges', 'bananas']].to_numpy())]
or
def func(row):
print(row[0]*row[1]-row[2])
[func(*row) for row in zip(df2[['apples', 'oranges', 'bananas']].to_numpy())]
Further reading:
- Memory efficient way for list comprehension of pandas dataframe using multiple columns
- Dataframe list comprehension "zip(…)": loop through chosen df columns efficiently with just a list of column name strings
- What is the most efficient way to loop through dataframes with pandas?
- Loop through dataframe one by one (pandas)
EDIT:
Please use df.iloc
and df.loc
instead of df[[...]]
, see Selecting multiple columns in a Pandas dataframe
You can get all the values of the row as a list using the np.array()
function inside your list of comprehension.
The following code solves your problem:
df2['optimal_fruit'] = [x[0] * x[1] - x[2] for x in np.array(df2)]
It is going to avoid the need of typing each column name in your list of comprehension.
I’m giving a toy example but it will help me understand what’s going on for something else I’m trying to do. Let’s say I want a new column in a dataframe ‘optimal_fruit’ that is apples * orange – bananas.
I can do something like this to get it.
df2['optimal_fruit'] = df2['apples'] * df2['oranges'] - df2['bananas']
apples oranges bananas optimal_fruit
1 6 11 -5
2 7 12 2
3 8 13 11
4 9 14 22
5 10 15 35
What is happening if I try to do something like this? And how could I do this in a list comprehension?
df2['optimal_fruit'] = [x * y - z for x in df2['apples'] for y in df2['oranges'] for z in df2['bananas']]
I get an error of:
ValueError: Length of values does not match length of index
As always, thank you all so much for your help!
The reason why your new method doesn’t work is because the list comprehension produces data that is longer than the number of indices in your dataframe. A quick fix for that would be something like:
[x * y - z for x,y,z in zip(df2['apples'], df2['oranges'], df2['bananas'])]
Essentially your list comprehension statement is a set of 3 nested loops. In code:
l = []
for x in df2['apples']:
for y in df2['oranges']:
for z in df2['bananas']:
l.append(x * y - z)
The length of your resultant list will be power-of-3 times the length of your DataFrame (5x5x5 = 125). Hence the error. To fix, you need the equivalent of:
for x, y, z in zip(df2['apples'], df2['oranges'], df2['bananas']):
l.extend([x * y - z])
In terms of list comprehension:
[x * y - z for x, y, z in zip(df2['apples'], df2['oranges'], df2['bananas'])]
If you do not want to repeat df2 for each column:
[row[0][0]*row[0][1]-row[0][2] for row in zip(df2[['apples', 'oranges', 'bananas']].to_numpy())]
or
def func(row):
print(row[0]*row[1]-row[2])
[func(*row) for row in zip(df2[['apples', 'oranges', 'bananas']].to_numpy())]
Further reading:
- Memory efficient way for list comprehension of pandas dataframe using multiple columns
- Dataframe list comprehension "zip(…)": loop through chosen df columns efficiently with just a list of column name strings
- What is the most efficient way to loop through dataframes with pandas?
- Loop through dataframe one by one (pandas)
EDIT:
Please use df.iloc
and df.loc
instead of df[[...]]
, see Selecting multiple columns in a Pandas dataframe
You can get all the values of the row as a list using the np.array()
function inside your list of comprehension.
The following code solves your problem:
df2['optimal_fruit'] = [x[0] * x[1] - x[2] for x in np.array(df2)]
It is going to avoid the need of typing each column name in your list of comprehension.