pandas: faster method than df.at[x,y]?
Question:
I have df1
df1 = pd.DataFrame({'x':[1,2,3,5],
'y':[2,3,4,6],
'value':[1.5,2.0,0.5,3.0]})
df1
x y value
0 1 2 1.5
1 2 3 2.0
2 3 4 0.5
3 5 6 3.0
and I want to assign the value
at x
and y
coordinates to another dataframe df2
df2 = pd.DataFrame(0.0, index=[x for x in range(0,df1['x'].max()+1)], columns=[y for y in range(0,df1['y'].max()+1)])
df2
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0
by
for x, y, value in zip(df1['x'],df1['y'],df1['value']):
df2.at[x,y] = value
to give
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.5 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.5 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 3.0
However, it is a bit slow because I have a long df1
.
Do we have a faster method than df.at[x,y]
?
Answers:
You can avoid create zero df2
and using df.at
method by DataFrame.pivot
, DataFrame.fillna
and DataFrame.reindex
:
df2 = (df1.pivot('x','y','value')
.fillna(0)
.reindex(index=range(df1['x'].max()+1),
columns=range(df1['y'].max()+1), fill_value=0))
print (df2)
y 0 1 2 3 4 5 6
x
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.5 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.5 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 3.0
Since your data is all numbers, you can use numpy; with a larger dataset, it might be faster than using pd.pivot
:
# create a flattened array from df2
temp = df2.to_numpy().ravel()
# get indices for a flattened array, based on df1.x and df1.y
arr = np.ravel_multi_index((df1.x, df1.y), df2.shape)
# replace at the positions with df1.value
temp[arr] = df1.value
# reshape and create dataframe
temp = temp.reshape(df2.shape)
pd.DataFrame(temp, columns = df2.columns)
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.5 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.5 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 3.0
Another neat way to do this (for numeric data) is using SciPy’s sparse matrix – your data is in sparse format:
from scipy.sparse import csr_matrix
df2_shape = df1['x'].max()+1, df1['y'].max()+1
sp_df1 = csr_matrix((df1['value'], (df1['x'], df1['y'])), shape=df2_shape)
pd.DataFrame.sparse.from_spmatrix(sp_df1)
In terms of speed, it’s comparable with sammywemmy’s numpy method for large datasets, and the intent is very clear.
Both are much faster than jezrael’s pivot approach, but that approach will work with all pandas datatypes, not just numeric.
There’s also a neat pandas one-liner if you have df2 setup (from this answer):
# this is an inplace operation - no need to assign
df2.update(df1.pivot(index='x', columns='y', values='value'))
This is the slowest, but performance may be acceptable if you like the style.
I have df1
df1 = pd.DataFrame({'x':[1,2,3,5],
'y':[2,3,4,6],
'value':[1.5,2.0,0.5,3.0]})
df1
x y value
0 1 2 1.5
1 2 3 2.0
2 3 4 0.5
3 5 6 3.0
and I want to assign the value
at x
and y
coordinates to another dataframe df2
df2 = pd.DataFrame(0.0, index=[x for x in range(0,df1['x'].max()+1)], columns=[y for y in range(0,df1['y'].max()+1)])
df2
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0
by
for x, y, value in zip(df1['x'],df1['y'],df1['value']):
df2.at[x,y] = value
to give
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.5 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.5 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 3.0
However, it is a bit slow because I have a long df1
.
Do we have a faster method than df.at[x,y]
?
You can avoid create zero df2
and using df.at
method by DataFrame.pivot
, DataFrame.fillna
and DataFrame.reindex
:
df2 = (df1.pivot('x','y','value')
.fillna(0)
.reindex(index=range(df1['x'].max()+1),
columns=range(df1['y'].max()+1), fill_value=0))
print (df2)
y 0 1 2 3 4 5 6
x
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.5 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.5 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 3.0
Since your data is all numbers, you can use numpy; with a larger dataset, it might be faster than using pd.pivot
:
# create a flattened array from df2
temp = df2.to_numpy().ravel()
# get indices for a flattened array, based on df1.x and df1.y
arr = np.ravel_multi_index((df1.x, df1.y), df2.shape)
# replace at the positions with df1.value
temp[arr] = df1.value
# reshape and create dataframe
temp = temp.reshape(df2.shape)
pd.DataFrame(temp, columns = df2.columns)
0 1 2 3 4 5 6
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.5 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.5 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 3.0
Another neat way to do this (for numeric data) is using SciPy’s sparse matrix – your data is in sparse format:
from scipy.sparse import csr_matrix
df2_shape = df1['x'].max()+1, df1['y'].max()+1
sp_df1 = csr_matrix((df1['value'], (df1['x'], df1['y'])), shape=df2_shape)
pd.DataFrame.sparse.from_spmatrix(sp_df1)
In terms of speed, it’s comparable with sammywemmy’s numpy method for large datasets, and the intent is very clear.
Both are much faster than jezrael’s pivot approach, but that approach will work with all pandas datatypes, not just numeric.
There’s also a neat pandas one-liner if you have df2 setup (from this answer):
# this is an inplace operation - no need to assign
df2.update(df1.pivot(index='x', columns='y', values='value'))
This is the slowest, but performance may be acceptable if you like the style.