Appending a large amount of columns into 1 column
Question:
I have a dataset of ~700 columns. I’d like to join all columns into a single column.
Input:
id | A | B | C | D | E | F | ... | Z
0 | yes | no | yes | no | yes| no | ... | no
1 | no | no | yes | no | no | no | ... | no
2 | yes | yes| yes | yes| yes| no | ... | no
Output:
id | A | B | C | D | E | F | ... | Z | joined_column
0 | yes | no | yes | no | yes| no | ... | no | yes no yes no yes no ... no
1 | no | no | yes | no | no | no | ... | no | no no yes no no no ... no
2 | yes | yes| yes | yes| yes| no | ... | no | yes yes yes yes yes no ... no
I’ve used the following in the past. However, I’m looking for a way to scale this when you’re working with a large amount of columns.
def join(df):
list = []
for i in range(0, df.shape[0]):
list.append( str(df['A'][i]) + ' ' + str(df['B'][i]) + ' ' + str(df['C'][i]))
return list
Answers:
Perhaps you can use a list comprehension:
df['joined_column'] = [' '.join(item) for item in df.iloc[:,1:].values]
id A B C D E F Z joined_column
0 0 yes no yes no yes no no yes no yes no yes no no
1 1 no no yes no no no no no no yes no no no no
2 2 yes yes yes yes yes no no yes yes yes yes yes no no
You can use agg
to aggregate data over a specific axis using a specific operation:
df['joined_column'] = df.agg(' '.join, axis=1)
Potentially use
df['joined'] = df.iloc[:,1:].agg(' '.join, axis=1)
if you don’t want to join the first (or any other column).
I used ' '.join
to join the values with a space as separator.
axis
is set to 1
because you want to join columns and not rows.
Given df
:
A B C D E F
0 yes no yes no yes no
1 no no yes no no no
2 yes yes yes yes yes no
Doing:
cols = df.columns
# As a string:
df['joined_column_str'] = df[cols].agg(' '.join, axis=1)
# As a list:
df['joined_column_list'] = df[cols].agg(list, axis=1)
Output:
A B C D E F joined_column_str joined_column_list
0 yes no yes no yes no yes no yes no yes no [yes, no, yes, no, yes, no]
1 no no yes no no no no no yes no no no [no, no, yes, no, no, no]
2 yes yes yes yes yes no yes yes yes yes yes no [yes, yes, yes, yes, yes, no]
Considering the dataframe df
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
One can use a custom lambda function inside pandas.DataFrame.apply
as follows
df['joined_column'] = df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no 0 yes no yes no yes no yes no yes no yes no ye...
1 1 no no ... no no 1 no no yes no no no no no no no no no no no n...
2 2 yes yes ... yes yes 2 yes yes yes yes yes no yes yes yes yes yes y...
However, as one doesn’t want the id
to appear in the joined_column
, one can add x.drop('id')
as
df['joined_column'] = df.apply(lambda x: ' '.join(x.drop('id').astype(str)), axis=1)
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
Alternatively, for this last case, one can also leverage Numpy, more specifically numpy.array
as follows
df['joined_column'] = np.array([' '.join(x.astype(str)) for x in df.drop('id', axis=1).values])
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
A possible solution:
df['joined_column'] = df.iloc[:,1:].add(' ').sum(axis=1).str.rstrip()
Yet another possible solution, using functools.reduce
:
from functools import reduce
df['joined_column'] = reduce(lambda x,y: x+' '+y, [df[i] for i in df.columns[1:]])
Output:
id A B C D E F Z joined_column
0 0 yes no yes no yes no no yes no yes no yes no no
1 1 no no yes no no no no no no yes no no no no
2 2 yes yes yes yes yes no no yes yes yes yes yes no no
I have a dataset of ~700 columns. I’d like to join all columns into a single column.
Input:
id | A | B | C | D | E | F | ... | Z
0 | yes | no | yes | no | yes| no | ... | no
1 | no | no | yes | no | no | no | ... | no
2 | yes | yes| yes | yes| yes| no | ... | no
Output:
id | A | B | C | D | E | F | ... | Z | joined_column
0 | yes | no | yes | no | yes| no | ... | no | yes no yes no yes no ... no
1 | no | no | yes | no | no | no | ... | no | no no yes no no no ... no
2 | yes | yes| yes | yes| yes| no | ... | no | yes yes yes yes yes no ... no
I’ve used the following in the past. However, I’m looking for a way to scale this when you’re working with a large amount of columns.
def join(df):
list = []
for i in range(0, df.shape[0]):
list.append( str(df['A'][i]) + ' ' + str(df['B'][i]) + ' ' + str(df['C'][i]))
return list
Perhaps you can use a list comprehension:
df['joined_column'] = [' '.join(item) for item in df.iloc[:,1:].values]
id A B C D E F Z joined_column
0 0 yes no yes no yes no no yes no yes no yes no no
1 1 no no yes no no no no no no yes no no no no
2 2 yes yes yes yes yes no no yes yes yes yes yes no no
You can use agg
to aggregate data over a specific axis using a specific operation:
df['joined_column'] = df.agg(' '.join, axis=1)
Potentially use
df['joined'] = df.iloc[:,1:].agg(' '.join, axis=1)
if you don’t want to join the first (or any other column).
I used ' '.join
to join the values with a space as separator.
axis
is set to 1
because you want to join columns and not rows.
Given df
:
A B C D E F
0 yes no yes no yes no
1 no no yes no no no
2 yes yes yes yes yes no
Doing:
cols = df.columns
# As a string:
df['joined_column_str'] = df[cols].agg(' '.join, axis=1)
# As a list:
df['joined_column_list'] = df[cols].agg(list, axis=1)
Output:
A B C D E F joined_column_str joined_column_list
0 yes no yes no yes no yes no yes no yes no [yes, no, yes, no, yes, no]
1 no no yes no no no no no yes no no no [no, no, yes, no, no, no]
2 yes yes yes yes yes no yes yes yes yes yes no [yes, yes, yes, yes, yes, no]
Considering the dataframe df
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
One can use a custom lambda function inside pandas.DataFrame.apply
as follows
df['joined_column'] = df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no 0 yes no yes no yes no yes no yes no yes no ye...
1 1 no no ... no no 1 no no yes no no no no no no no no no no no n...
2 2 yes yes ... yes yes 2 yes yes yes yes yes no yes yes yes yes yes y...
However, as one doesn’t want the id
to appear in the joined_column
, one can add x.drop('id')
as
df['joined_column'] = df.apply(lambda x: ' '.join(x.drop('id').astype(str)), axis=1)
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
Alternatively, for this last case, one can also leverage Numpy, more specifically numpy.array
as follows
df['joined_column'] = np.array([' '.join(x.astype(str)) for x in df.drop('id', axis=1).values])
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
A possible solution:
df['joined_column'] = df.iloc[:,1:].add(' ').sum(axis=1).str.rstrip()
Yet another possible solution, using functools.reduce
:
from functools import reduce
df['joined_column'] = reduce(lambda x,y: x+' '+y, [df[i] for i in df.columns[1:]])
Output:
id A B C D E F Z joined_column
0 0 yes no yes no yes no no yes no yes no yes no no
1 1 no no yes no no no no no no yes no no no no
2 2 yes yes yes yes yes no no yes yes yes yes yes no no