How to organise DataFrame columns

Question

I am trying to organise DataFrame columns based on the specific rules, but I don’t know the way.

For example, I have a DataFrame related to chemistry as shown below.
Each row shows the number of chemical bonds in a chemical compound.

   OH  HO  CaO  OCa  OO  NaMg  MgNa
0   2   3    2    0   1     1     1
1   0   2    3    4   5     2     0
2   1   2    3    0   0     0     0

In chemistry, OH (Oxygen-Hydrogen) bond is equal to HO (Hydrogen-Oxygen) bond and CaO (Calcium-Oxygen) bond is equal to OCa (Oxygen-Calcium) bond in the meaning. Thus, I’d like to organise the DataFrame as shown below.

   OH  CaO  OO  NaMg 
0   5    2   1     2
1   2    7   9     2
2   3    3   0     0

I’m struggling because:

there are a variety of chemical bonds in my real DataFrame, so it is impossible to organise the information one by one (The number of columns is more than 3,000 and I don’t know which kinds of chemical bonds exist and are duplicates.)
the number of letters depends on each element symbol and some symbols include lowercase
(e.g. Hydrogen: H (one letter and only uppercase), Calcium: Ca (Two letters and uppercase & lowercase)

I looked for the same question online and wrote codes by myself, but I was not able to find the way. I would like to know the codes which solve my problem.

Asked By: hiro

||

Source

Answer 1

You can use str.findall to extract individual element ~~and use frozenset~~ and sort individual elements to reorganize the pairs. Using frozenset is not a good solution because for OO, the second will be lost.

Now you can group by this sets and apply sum:

# Modified from https://www.johndcook.com/blog/2016/02/04/regular-expression-to-match-a-chemical-element/
pat = r'(A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(?:u[opst])?|V|W|Xe|Yb?|Z[nr])'

grp = df.columns.str.findall(pat).map(lambda x: tuple(sorted(x))))
out = df.groupby(grp, axis=1).sum().rename(columns=''.join)

Output:

>>> out
   CaO  HO  MgNa  OO
0    2   5     2   1
1    7   2     2   5
2    3   3     0   0

Answered By: Corralien

Answer 2

Another approach using a regex and sorted:

import re

sorter = lambda x: ''.join(sorted(re.findall('[A-Z][a-z]*', x)))

out = (df.groupby(df.columns.map(sorter), axis=1, sort=False)
         .sum()
       )

Output:

   HO  CaO  OO  MgNa
0   5    2   1     2
1   2    7   5     2
2   3    3   0     0

Answered By: mozway

Answer 3

Another possible solution:

df.columns = (pd.DataFrame
              .from_records([[''.join(sorted(x)), x] for x in df.columns])
              .groupby(0)[1].transform('first').to_list())
df.stack().groupby(level=[0,1]).sum().unstack()

Output:

   CaO  NaMg  OH  OO
0    2     2   5   1
1    7     2   2   5
2    3     0   3   0

Answered By: PaulS

How to organise DataFrame columns

Question:

Answers: