converting sparse dataframe to dense dataframe
Question:
I have sparse data stored in a dataframe:
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
a b data
0 1 2 -0.824022
1 3 5 0.503239
2 5 5 -0.540105
Since I care about the null data the actual data would look like this:
true_df
a b data
0 1 1 NaN
1 1 2 -0.824022
2 1 3 NaN
3 1 4 NaN
4 1 5 NaN
5 2 1 NaN
6 2 2 NaN
7 2 3 NaN
8 2 4 NaN
9 2 5 NaN
10 3 1 NaN
11 3 2 NaN
12 3 3 NaN
13 3 4 NaN
14 3 5 0.503239
15 4 1 NaN
16 4 2 NaN
17 4 3 NaN
18 4 4 NaN
19 4 5 NaN
20 5 1 NaN
21 5 2 NaN
22 5 3 NaN
23 5 4 NaN
24 5 5 -0.540105
My question is how do I construct true_df
? I was hoping there was some way to use pd.concat
or pd.merge
, that is, construct a dataframe that is the shape of the dense table and then join the two dataframes but that doesn’t join in the expected way (the columns are not combined). The ultimate goal is to pivot on a and b.
As a follow up because I think kinjo is correct, why does this only work for integers and not for floats? Using:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.arange(1,df.b.max()+0.1, 0.1) for a in np.arange(1,df.a.max()+0.1,0.1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
Will return:
a b data
0 1.0 1.0 NaN
1 1.0 1.1 NaN
2 1.0 1.2 NaN
3 1.0 1.3 NaN
4 1.0 1.4 NaN
5 1.0 1.5 NaN
6 1.0 1.6 NaN
7 1.1 1.0 NaN
8 1.1 1.1 NaN
9 1.1 1.2 NaN
10 1.1 1.3 NaN
11 1.1 1.4 NaN
12 1.1 1.5 NaN
13 1.1 1.6 NaN
14 1.2 1.0 NaN
15 1.2 1.1 NaN
16 1.2 1.2 NaN
17 1.2 1.3 NaN
18 1.2 1.4 NaN
19 1.2 1.5 NaN
20 1.2 1.6 NaN
21 1.3 1.0 NaN
22 1.3 1.1 NaN
23 1.3 1.2 NaN
24 1.3 1.3 NaN
25 1.3 1.4 NaN
26 1.3 1.5 NaN
27 1.3 1.6 NaN
28 1.4 1.0 NaN
29 1.4 1.1 NaN
30 1.4 1.2 NaN
31 1.4 1.3 NaN
32 1.4 1.4 NaN
33 1.4 1.5 NaN
34 1.4 1.6 NaN
35 1.5 1.0 NaN
36 1.5 1.1 NaN
37 1.5 1.2 NaN
38 1.5 1.3 NaN
39 1.5 1.4 NaN
40 1.5 1.5 NaN
41 1.5 1.6 NaN
42 1.6 1.0 NaN
43 1.6 1.1 NaN
44 1.6 1.2 NaN
45 1.6 1.3 NaN
46 1.6 1.4 NaN
47 1.6 1.5 NaN
48 1.6 1.6 NaN
Answers:
Since you intend to pivot an a
and b
, you could obtain the pivoted result with
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
result = pd.DataFrame(np.nan, index=range(1,6), columns=range(1,6))
result.update(df.pivot(index='a', columns='b', values='data'))
print(result)
which yields
1 2 3 4 5
1 NaN 0.436389 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN -1.066621
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 0.328880
Reindex is a straightforward solution. Similar to @jezrael’s solution, but no need for merge.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in range(1,df.b.max()+1) for a in range(1,df.a.max()+1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex)
You can then reset the index if you want the numeric count as your overall index.
In the case that your index is floats you should use linspace and not arange:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.linspace(a_min, a_max, a_step, endpoint=False) for a in np.linspace(b_min, b_max, b_step, endpoint=False)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
This is a nice fast approach for converting numeric data from sparse to dense, using SciPy’s sparse functionality. Works if your ultimate goal is the pivoted (i.e. dense) dataframe:
import pandas as pd
from scipy.sparse import csr_matrix
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
df_shape = df['a'].max()+1, df['b'].max()+1
sp_df = csr_matrix((df['data'], (df['a'], df['b'])), shape=df_shape)
df_dense = pd.DataFrame.sparse.from_spmatrix(sp_df)
I have sparse data stored in a dataframe:
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
a b data
0 1 2 -0.824022
1 3 5 0.503239
2 5 5 -0.540105
Since I care about the null data the actual data would look like this:
true_df
a b data
0 1 1 NaN
1 1 2 -0.824022
2 1 3 NaN
3 1 4 NaN
4 1 5 NaN
5 2 1 NaN
6 2 2 NaN
7 2 3 NaN
8 2 4 NaN
9 2 5 NaN
10 3 1 NaN
11 3 2 NaN
12 3 3 NaN
13 3 4 NaN
14 3 5 0.503239
15 4 1 NaN
16 4 2 NaN
17 4 3 NaN
18 4 4 NaN
19 4 5 NaN
20 5 1 NaN
21 5 2 NaN
22 5 3 NaN
23 5 4 NaN
24 5 5 -0.540105
My question is how do I construct true_df
? I was hoping there was some way to use pd.concat
or pd.merge
, that is, construct a dataframe that is the shape of the dense table and then join the two dataframes but that doesn’t join in the expected way (the columns are not combined). The ultimate goal is to pivot on a and b.
As a follow up because I think kinjo is correct, why does this only work for integers and not for floats? Using:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.arange(1,df.b.max()+0.1, 0.1) for a in np.arange(1,df.a.max()+0.1,0.1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
Will return:
a b data
0 1.0 1.0 NaN
1 1.0 1.1 NaN
2 1.0 1.2 NaN
3 1.0 1.3 NaN
4 1.0 1.4 NaN
5 1.0 1.5 NaN
6 1.0 1.6 NaN
7 1.1 1.0 NaN
8 1.1 1.1 NaN
9 1.1 1.2 NaN
10 1.1 1.3 NaN
11 1.1 1.4 NaN
12 1.1 1.5 NaN
13 1.1 1.6 NaN
14 1.2 1.0 NaN
15 1.2 1.1 NaN
16 1.2 1.2 NaN
17 1.2 1.3 NaN
18 1.2 1.4 NaN
19 1.2 1.5 NaN
20 1.2 1.6 NaN
21 1.3 1.0 NaN
22 1.3 1.1 NaN
23 1.3 1.2 NaN
24 1.3 1.3 NaN
25 1.3 1.4 NaN
26 1.3 1.5 NaN
27 1.3 1.6 NaN
28 1.4 1.0 NaN
29 1.4 1.1 NaN
30 1.4 1.2 NaN
31 1.4 1.3 NaN
32 1.4 1.4 NaN
33 1.4 1.5 NaN
34 1.4 1.6 NaN
35 1.5 1.0 NaN
36 1.5 1.1 NaN
37 1.5 1.2 NaN
38 1.5 1.3 NaN
39 1.5 1.4 NaN
40 1.5 1.5 NaN
41 1.5 1.6 NaN
42 1.6 1.0 NaN
43 1.6 1.1 NaN
44 1.6 1.2 NaN
45 1.6 1.3 NaN
46 1.6 1.4 NaN
47 1.6 1.5 NaN
48 1.6 1.6 NaN
Since you intend to pivot an a
and b
, you could obtain the pivoted result with
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
result = pd.DataFrame(np.nan, index=range(1,6), columns=range(1,6))
result.update(df.pivot(index='a', columns='b', values='data'))
print(result)
which yields
1 2 3 4 5
1 NaN 0.436389 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN -1.066621
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 0.328880
Reindex is a straightforward solution. Similar to @jezrael’s solution, but no need for merge.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in range(1,df.b.max()+1) for a in range(1,df.a.max()+1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex)
You can then reset the index if you want the numeric count as your overall index.
In the case that your index is floats you should use linspace and not arange:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.linspace(a_min, a_max, a_step, endpoint=False) for a in np.linspace(b_min, b_max, b_step, endpoint=False)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
This is a nice fast approach for converting numeric data from sparse to dense, using SciPy’s sparse functionality. Works if your ultimate goal is the pivoted (i.e. dense) dataframe:
import pandas as pd
from scipy.sparse import csr_matrix
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
df_shape = df['a'].max()+1, df['b'].max()+1
sp_df = csr_matrix((df['data'], (df['a'], df['b'])), shape=df_shape)
df_dense = pd.DataFrame.sparse.from_spmatrix(sp_df)