Select column by regex in python
Question:
I have a dataframe with following columns
Name Company generic name generic name R&D Number (DC-) R&D Number (A) type
A AB 53654 5767 1111 3333 a
C CD 56767 56667
I would like to create a subset of this dataframe and combine columns with a pattern by aggregating the values by a ‘,’
The desired output is:
Name Company generic name R&D Number
A AB 53654, 5767 1111, 3333
C CD 56767 56667
I found there is a way to filter the columns based on regex like below
df.filter(regex=("R&D Number.*"))
But is there a way to join with the other columns and concatenate the value by ‘,’ to get the final output. Any help is highly appreciated
Answers:
You can normalize the names by removing the trailing (xx)
part (adapt the exact logic if needed), then groupby
on the columns to aggregate the values:
out = (df.groupby(df.columns.str.replace(r's*(.+)$', '', regex=True), axis=1)
.agg(lambda g: g.replace('', None).stack().astype(str)
.groupby(level=0).agg(','.join))
)
Or using a double transpose
trick:
out = (df.T
.groupby(df.columns.str.replace(r's*(.+)$', '', regex=True))
.agg(lambda g: ','.join(g.replace('', None).dropna().astype(str)))
.T
)
Output:
Company Name R&D Number generic name type
0 AB A 1111,3333 53654,5767 a
1 CD C 56667 56767 NaN
Reproducible input:
df = pd.DataFrame.from_dict({'index': [0, 1],
'columns': ['Name', 'Company', 'generic name', 'generic name', 'R&D Number (DC-)', 'R&D Number (A)', 'type'],
'data': [['A', 'AB', '53654', '5767', '1111', '3333', 'a'], ['C', 'CD', '56767', '', '56667', '', '']],
'index_names': [None],
'column_names': [None]},
orient='tight')
I have a dataframe with following columns
Name Company generic name generic name R&D Number (DC-) R&D Number (A) type
A AB 53654 5767 1111 3333 a
C CD 56767 56667
I would like to create a subset of this dataframe and combine columns with a pattern by aggregating the values by a ‘,’
The desired output is:
Name Company generic name R&D Number
A AB 53654, 5767 1111, 3333
C CD 56767 56667
I found there is a way to filter the columns based on regex like below
df.filter(regex=("R&D Number.*"))
But is there a way to join with the other columns and concatenate the value by ‘,’ to get the final output. Any help is highly appreciated
You can normalize the names by removing the trailing (xx)
part (adapt the exact logic if needed), then groupby
on the columns to aggregate the values:
out = (df.groupby(df.columns.str.replace(r's*(.+)$', '', regex=True), axis=1)
.agg(lambda g: g.replace('', None).stack().astype(str)
.groupby(level=0).agg(','.join))
)
Or using a double transpose
trick:
out = (df.T
.groupby(df.columns.str.replace(r's*(.+)$', '', regex=True))
.agg(lambda g: ','.join(g.replace('', None).dropna().astype(str)))
.T
)
Output:
Company Name R&D Number generic name type
0 AB A 1111,3333 53654,5767 a
1 CD C 56667 56767 NaN
Reproducible input:
df = pd.DataFrame.from_dict({'index': [0, 1],
'columns': ['Name', 'Company', 'generic name', 'generic name', 'R&D Number (DC-)', 'R&D Number (A)', 'type'],
'data': [['A', 'AB', '53654', '5767', '1111', '3333', 'a'], ['C', 'CD', '56767', '', '56667', '', '']],
'index_names': [None],
'column_names': [None]},
orient='tight')