Randomly concat data frames by row
Question:
How can I randomly merge, join or concat pandas data frames by row? Suppose I have four data frames something like this (with a lot more rows):
df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"]})
How can I join these four data frames randomly output something like this (they are randomly merged row for row):
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0 1_1 1_2 1_3 4_1 4_2 4_3 2_1 2_2 2_3 3_1 3_2 3_3
1 2_1 2_2 2_3 1_1 1_2 1_3 3_1 3_2 3_3 4_1 4_2 4_3
I was thinking I could do something like this:
my_list = [df1,df2,df3,df4]
my_list = random.sample(my_list, len(my_list))
df = pd.DataFrame({'empty' : []})
for row in df:
new_df = pd.concat(my_list, axis=1)
print new_df
Above for
statement will not work for more than the first row, every row after (I have more) will just be the same, i.e it will only shuffle once:
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0 4_1 4_2 4_3 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3
1 4_1 4_2 4_3 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3
Answers:
Maybe something like this?
import random
import numpy as np
dfs = [df1, df2, df3, df4]
n = np.sum(len(df.columns) for df in dfs)
pd.concat(dfs, axis=1).iloc[:, random.sample(range(n), n)]
Out[130]:
col1 col3 col1 col2 col1 col1 col2 col2 col3 col3 col3 col2
0 4_1 4_3 1_1 4_2 2_1 3_1 1_2 3_2 1_3 3_3 2_3 2_2
Or, if only the df’s should be shuffled, you can do:
dfs = [df1, df2, df3, df4]
random.shuffle(dfs)
pd.concat(dfs, axis=1)
Out[133]:
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0 4_1 4_2 4_3 2_1 2_2 2_3 1_1 1_2 1_3 3_1 3_2 3_3
UPDATE: a much better solution from @Divakar:
df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"], 'col4':["1_4", "1_4"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"], 'col4':["2_4", "2_4"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"], 'col4':["3_4", "3_4"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"], 'col4':["4_4", "4_4"]})
dfs = [df1, df2, df3, df4]
n = len(dfs)
nrows = dfs[0].shape[0]
ncols = dfs[0].shape[1]
A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)
sidx = np.random.rand(nrows,n).argsort(1)
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
df = pd.DataFrame(out_arr)
Output:
In [203]: df
Out[203]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3_1 3_2 3_3 3_4 1_1 1_2 1_3 1_4 4_1 4_2 4_3 4_4 2_1 2_2 2_3 2_4
1 4_1 4_2 4_3 4_4 2_1 2_2 2_3 2_4 3_1 3_2 3_3 3_4 1_1 1_2 1_3 1_4
Explanation: (c) Divakar
NumPy based solution
Let’s have a NumPy based vectorized solution and hopefully a fast one!
1) Let’s reshape an array of concatenated values into a 3D
array “cutting” each row into groups of ncols
corresponding to the # of columns in each of the input dataframes –
A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)
2) Next up, we trick np.aragsort
to give us random unique indices ranging from 0 to N-1
, where N is the number of input dataframes –
sidx = np.random.rand(nrows,n).argsort(1)
3) Final trick is NumPy’s fancy indexing together with some broadcasting to index into A
with sidx
to give us the output array –
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
4) If needed, convert to dataframe –
df = pd.DataFrame(out_arr)
OLD answer:
IIUC you can do it this way:
dfs = [df1, df2, df3, df4]
n = len(dfs)
ncols = dfs[0].shape[1]
v = pd.concat(dfs, axis=1).values
a = np.arange(n * ncols).reshape(n, df1.shape[1])
df = pd.DataFrame(np.asarray([v[i, a[random.sample(range(n), n)].reshape(n * ncols,)] for i in dfs[0].index]))
Output
In [150]: df
Out[150]:
0 1 2 3 4 5 6 7 8 9 10 11
0 1_1 1_2 1_3 3_1 3_2 3_3 4_1 4_2 4_3 2_1 2_2 2_3
1 2_1 2_2 2_3 1_1 1_2 1_3 3_1 3_2 3_3 4_1 4_2 4_3
Explanation:
In [151]: v
Out[151]:
array([['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3'],
['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3']], dtype=object)
In [152]: a
Out[152]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
I think this answer it is easier and it works for every df dimension
df = pd.concat([df1, df2, df3, df4])
df = df.sample(frac=1)
the sample gives you a random sample of the df. If you ask for the complete df. It will randomize the columns
How can I randomly merge, join or concat pandas data frames by row? Suppose I have four data frames something like this (with a lot more rows):
df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"]})
How can I join these four data frames randomly output something like this (they are randomly merged row for row):
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0 1_1 1_2 1_3 4_1 4_2 4_3 2_1 2_2 2_3 3_1 3_2 3_3
1 2_1 2_2 2_3 1_1 1_2 1_3 3_1 3_2 3_3 4_1 4_2 4_3
I was thinking I could do something like this:
my_list = [df1,df2,df3,df4]
my_list = random.sample(my_list, len(my_list))
df = pd.DataFrame({'empty' : []})
for row in df:
new_df = pd.concat(my_list, axis=1)
print new_df
Above for
statement will not work for more than the first row, every row after (I have more) will just be the same, i.e it will only shuffle once:
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0 4_1 4_2 4_3 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3
1 4_1 4_2 4_3 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3
Maybe something like this?
import random
import numpy as np
dfs = [df1, df2, df3, df4]
n = np.sum(len(df.columns) for df in dfs)
pd.concat(dfs, axis=1).iloc[:, random.sample(range(n), n)]
Out[130]:
col1 col3 col1 col2 col1 col1 col2 col2 col3 col3 col3 col2
0 4_1 4_3 1_1 4_2 2_1 3_1 1_2 3_2 1_3 3_3 2_3 2_2
Or, if only the df’s should be shuffled, you can do:
dfs = [df1, df2, df3, df4]
random.shuffle(dfs)
pd.concat(dfs, axis=1)
Out[133]:
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0 4_1 4_2 4_3 2_1 2_2 2_3 1_1 1_2 1_3 3_1 3_2 3_3
UPDATE: a much better solution from @Divakar:
df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"], 'col4':["1_4", "1_4"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"], 'col4':["2_4", "2_4"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"], 'col4':["3_4", "3_4"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"], 'col4':["4_4", "4_4"]})
dfs = [df1, df2, df3, df4]
n = len(dfs)
nrows = dfs[0].shape[0]
ncols = dfs[0].shape[1]
A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)
sidx = np.random.rand(nrows,n).argsort(1)
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
df = pd.DataFrame(out_arr)
Output:
In [203]: df
Out[203]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3_1 3_2 3_3 3_4 1_1 1_2 1_3 1_4 4_1 4_2 4_3 4_4 2_1 2_2 2_3 2_4
1 4_1 4_2 4_3 4_4 2_1 2_2 2_3 2_4 3_1 3_2 3_3 3_4 1_1 1_2 1_3 1_4
Explanation: (c) Divakar
NumPy based solution
Let’s have a NumPy based vectorized solution and hopefully a fast one!
1) Let’s reshape an array of concatenated values into a 3D
array “cutting” each row into groups of ncols
corresponding to the # of columns in each of the input dataframes –
A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)
2) Next up, we trick np.aragsort
to give us random unique indices ranging from 0 to N-1
, where N is the number of input dataframes –
sidx = np.random.rand(nrows,n).argsort(1)
3) Final trick is NumPy’s fancy indexing together with some broadcasting to index into A
with sidx
to give us the output array –
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
4) If needed, convert to dataframe –
df = pd.DataFrame(out_arr)
OLD answer:
IIUC you can do it this way:
dfs = [df1, df2, df3, df4]
n = len(dfs)
ncols = dfs[0].shape[1]
v = pd.concat(dfs, axis=1).values
a = np.arange(n * ncols).reshape(n, df1.shape[1])
df = pd.DataFrame(np.asarray([v[i, a[random.sample(range(n), n)].reshape(n * ncols,)] for i in dfs[0].index]))
Output
In [150]: df
Out[150]:
0 1 2 3 4 5 6 7 8 9 10 11
0 1_1 1_2 1_3 3_1 3_2 3_3 4_1 4_2 4_3 2_1 2_2 2_3
1 2_1 2_2 2_3 1_1 1_2 1_3 3_1 3_2 3_3 4_1 4_2 4_3
Explanation:
In [151]: v
Out[151]:
array([['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3'],
['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3']], dtype=object)
In [152]: a
Out[152]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
I think this answer it is easier and it works for every df dimension
df = pd.concat([df1, df2, df3, df4])
df = df.sample(frac=1)
the sample gives you a random sample of the df. If you ask for the complete df. It will randomize the columns