How to organise DataFrame columns
Question:
I am trying to organise DataFrame columns based on the specific rules, but I don’t know the way.
For example, I have a DataFrame related to chemistry as shown below.
Each row shows the number of chemical bonds in a chemical compound.
OH HO CaO OCa OO NaMg MgNa
0 2 3 2 0 1 1 1
1 0 2 3 4 5 2 0
2 1 2 3 0 0 0 0
In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I’d like to organise the DataFrame as shown below.
OH CaO OO NaMg
0 5 2 1 2
1 2 7 9 2
2 3 3 0 0
I’m struggling because:
- there are a variety of chemical bonds in my real DataFrame, so it is impossible to organise the information one by one (The number of columns is more than 3,000 and I don’t know which kinds of chemical bonds exist and are duplicates.)
- the number of letters depends on each element symbol and some symbols include lowercase
(e.g. Hydrogen: H (one letter and only uppercase), Calcium: Ca (Two letters and uppercase & lowercase)
I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.
Answers:
You can use str.findall
to extract individual element and use frozenset
and sort individual elements to reorganize the pairs. Using frozenset
is not a good solution because for OO
, the second will be lost.
Now you can group by this sets and apply sum:
# Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])'
grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)
Output:
>>> out
CaO HO MgNa OO
0 2 5 2 1
1 7 2 2 5
2 3 3 0 0
Another approach using a regex and sorted
:
import re
sorter = lambda x: ''.join(sorted(re.findall('[A-Z][a-z]*', x)))
out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
.sum()
)
Output:
HO CaO OO MgNa
0 5 2 1 2
1 2 7 5 2
2 3 3 0 0
Another possible solution:
df.columns = (pd.DataFrame
.from_records([[''.join(sorted(x)), x] for x in df.columns])
.groupby(0)[1].transform('first').to_list())
df.stack().groupby(level=[0,1]).sum().unstack()
Output:
CaO NaMg OH OO
0 2 2 5 1
1 7 2 2 5
2 3 0 3 0
I am trying to organise DataFrame columns based on the specific rules, but I don’t know the way.
For example, I have a DataFrame related to chemistry as shown below.
Each row shows the number of chemical bonds in a chemical compound.
OH HO CaO OCa OO NaMg MgNa
0 2 3 2 0 1 1 1
1 0 2 3 4 5 2 0
2 1 2 3 0 0 0 0
In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I’d like to organise the DataFrame as shown below.
OH CaO OO NaMg
0 5 2 1 2
1 2 7 9 2
2 3 3 0 0
I’m struggling because:
- there are a variety of chemical bonds in my real DataFrame, so it is impossible to organise the information one by one (The number of columns is more than 3,000 and I don’t know which kinds of chemical bonds exist and are duplicates.)
- the number of letters depends on each element symbol and some symbols include lowercase
(e.g. Hydrogen: H (one letter and only uppercase), Calcium: Ca (Two letters and uppercase & lowercase)
I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.
You can use str.findall
to extract individual element and use and sort individual elements to reorganize the pairs. Using frozenset
frozenset
is not a good solution because for OO
, the second will be lost.
Now you can group by this sets and apply sum:
# Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])'
grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)
Output:
>>> out
CaO HO MgNa OO
0 2 5 2 1
1 7 2 2 5
2 3 3 0 0
Another approach using a regex and sorted
:
import re
sorter = lambda x: ''.join(sorted(re.findall('[A-Z][a-z]*', x)))
out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
.sum()
)
Output:
HO CaO OO MgNa
0 5 2 1 2
1 2 7 5 2
2 3 3 0 0
Another possible solution:
df.columns = (pd.DataFrame
.from_records([[''.join(sorted(x)), x] for x in df.columns])
.groupby(0)[1].transform('first').to_list())
df.stack().groupby(level=[0,1]).sum().unstack()
Output:
CaO NaMg OH OO
0 2 2 5 1
1 7 2 2 5
2 3 0 3 0