Multiply pandas data-frame to a fixed number of rows
Question:
I have a data-frame. I want to multiply (essentially duplicate the data-frame) to a fixed number of target rows.
df:
col1 col2 col3
A1 B1 C1
A13 B13 C13
A27 B27 C27
I want to duplicate this data-frame so that the resulting data-frame should have 10 rows, Essentially each row should be multiplied three times and the 10th row could be any one of the three rows.
Answers:
I think need divmod
for repeat all rows and for repeat only one:
N = 10
a, b = divmod(N,len(df))
print (a, b)
3 1
Solution if all columns have same dtypes with numpy.repeat
:
c = np.repeat(df.values, a, axis=0)
d = np.repeat(df.values[-1], b, axis=0)
df = pd.DataFrame(np.vstack((c,d)), columns=df.columns)
print (df)
col1 col2 col3
0 A1 B1 C1
1 A1 B1 C1
2 A1 B1 C1
3 A13 B13 C13
4 A13 B13 C13
5 A13 B13 C13
6 A27 B27 C27
7 A27 B27 C27
8 A27 B27 C27
9 A27 B27 C27
Solutions if possible different dtypes:
Only pandas solution with concat
:
df = pd.concat([df] * a + [df.iloc[[-1]]] * b).sort_values('col1').reset_index(drop=True)
print (df)
col1 col2 col3
0 A1 B1 C1
1 A1 B1 C1
2 A1 B1 C1
3 A13 B13 C13
4 A13 B13 C13
5 A13 B13 C13
6 A27 B27 C27
7 A27 B27 C27
8 A27 B27 C27
9 A27 B27 C27
Solution with repeat only indices and loc
for repeat rows:
idx = np.hstack((np.repeat(df.index[:-1], a), np.repeat(df.index[-1], a + b)))
df = df.loc[idx].reset_index(drop=True)
Another solution, which answer partially your question but might be helpful for others:
N = 200000
big_df = pd.DataFrame(df.to_dict(orient="records") * N)
I have a data-frame. I want to multiply (essentially duplicate the data-frame) to a fixed number of target rows.
df:
col1 col2 col3
A1 B1 C1
A13 B13 C13
A27 B27 C27
I want to duplicate this data-frame so that the resulting data-frame should have 10 rows, Essentially each row should be multiplied three times and the 10th row could be any one of the three rows.
I think need divmod
for repeat all rows and for repeat only one:
N = 10
a, b = divmod(N,len(df))
print (a, b)
3 1
Solution if all columns have same dtypes with numpy.repeat
:
c = np.repeat(df.values, a, axis=0)
d = np.repeat(df.values[-1], b, axis=0)
df = pd.DataFrame(np.vstack((c,d)), columns=df.columns)
print (df)
col1 col2 col3
0 A1 B1 C1
1 A1 B1 C1
2 A1 B1 C1
3 A13 B13 C13
4 A13 B13 C13
5 A13 B13 C13
6 A27 B27 C27
7 A27 B27 C27
8 A27 B27 C27
9 A27 B27 C27
Solutions if possible different dtypes:
Only pandas solution with concat
:
df = pd.concat([df] * a + [df.iloc[[-1]]] * b).sort_values('col1').reset_index(drop=True)
print (df)
col1 col2 col3
0 A1 B1 C1
1 A1 B1 C1
2 A1 B1 C1
3 A13 B13 C13
4 A13 B13 C13
5 A13 B13 C13
6 A27 B27 C27
7 A27 B27 C27
8 A27 B27 C27
9 A27 B27 C27
Solution with repeat only indices and loc
for repeat rows:
idx = np.hstack((np.repeat(df.index[:-1], a), np.repeat(df.index[-1], a + b)))
df = df.loc[idx].reset_index(drop=True)
Another solution, which answer partially your question but might be helpful for others:
N = 200000
big_df = pd.DataFrame(df.to_dict(orient="records") * N)