Pandas extract every nth column that start with same string prefix

Question

import pandas as pd

data = {'T1_sometext': [1,2,3,4,5,6],
        'T1_anothertext': [1,2,3,4,5,6],
        "T1_anothertext2": [1,2,3,4,5,6],
        "T2_anothertext2": [1,2,3,4,5,6],
        "T2_anothertext4": [1,2,3,4,5,6],
        "T2_anothertext5": [1,2,3,4,5,6],
        }
df = pd.DataFrame(data)

How do I extract the T1 columns together and T2 columns together, so that I could plot them separately?

Is there a regex-way to do it? Let’s say I have up to T20. I would like to automate it using python code.

In other words, I would like to extract (or select) every 3 columns with the same prefix, that is, starting with same string (T1, T2, T3, etc…)

Thank you!

Asked By: SevenSouls

||

Source

Answer 1

You can extract the IDs and use groupby to get subgroups:

names = df.columns.str.extract(r'([^_]+)', expand=False)

for name, d in df.groupby(names, axis=1):
    print(f'>>> {name}')
    print(d)

Output:

>>> T1
   T1_sometext  T1_anothertext  T1_anothertext2
0            1               1                1
1            2               2                2
2            3               3                
...
>>> T2
   T2_anothertext2  T2_anothertext4  T2_anothertext5
0                1                1                1
1                2                2                2
...

Or as dictionary:

d = {name: d for name, d in df.groupby(names, axis=1)}

Output:

{'T1': subdf_with_T1
 'T2': subdf_with_T2
}

alternative using a `MultiIndex`:

df2 = df.set_axis(df.columns.str.split('_', 1, expand=True), axis=1)

for c in df2. columns.unique(level=0):
    print(f'>>> {c}')
    print(df2[c])

Output:

>>> T1
   sometext  anothertext  anothertext2
0         1            1             1
1         2            2             2
...
>>> T2
   anothertext2  anothertext4  anothertext5
0             1             1             1
1             2             2             2
...

df2:

        T1                                    T2                          
  sometext anothertext anothertext2 anothertext2 anothertext4 anothertext5
0        1           1            1            1            1            1
1        2           2            2            2            2            2
2        3           3            3            3            3            3
3        4           4            4            4            4            4
4        5           5            5            5            5            5
...

Answered By: mozway

Answer 2

You can split the keys using "_" and group the data.

import pprint

data = {'T1_sometext': [1, 2, 3, 4, 5, 6],
        'T1_anothertext': [1, 2, 3, 4, 5, 6],
        "T1_anothertext2": [1, 2, 3, 4, 5, 6],
        "T2_anothertext2": [1, 2, 3, 4, 5, 6],
        "T2_anothertext4": [1, 2, 3, 4, 5, 6],
        "T2_anothertext5": [1, 2, 3, 4, 5, 6],
        }

newdata = {}
for key in data:
    k = key.split("_")[0]
    if k not in newdata:
        newdata[k] = {}
    newdata[k][key] = data[key]

pprint.pprint(newdata)

Output:

{'T1': {'T1_anothertext': [1, 2, 3, 4, 5, 6],
        'T1_anothertext2': [1, 2, 3, 4, 5, 6],
        'T1_sometext': [1, 2, 3, 4, 5, 6]},
 'T2': {'T2_anothertext2': [1, 2, 3, 4, 5, 6],
        'T2_anothertext4': [1, 2, 3, 4, 5, 6],
        'T2_anothertext5': [1, 2, 3, 4, 5, 6]}}

Answered By: flyby

Answer 3

Here’s a way to do it by gathering all the columns into 1 dictionary. First I define a list of keys from the column name prefix as k_list. Then I iterate over the column names, and append the list with the relevant columns using dict.get().

import pandas as pd
import re

data = {'T1_sometext': [1,2,3,4,5,6],
        'T1_anothertext': [1,2,3,4,5,6],
        "T1_anothertext2": [1,2,3,4,5,6],
        "T2_anothertext2": [1,2,3,4,5,6],
        "T2_anothertext4": [1,2,3,4,5,6],
        "T2_anothertext5": [1,2,3,4,5,6],
        }
df = pd.DataFrame(data)
d = {}
k_list = re.findall('(Td+)_', ','.join(df.columns))

for i, col in enumerate(df.columns):
    d[k_list[i]] = d.get(k_list[i], list(df[col])) + list(df[col])

print(d)

Output:

{'T1': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
 'T2': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]}

You can also see how it looks like in a dataframe

df_new = pd.DataFrame(d)

    T1  T2
0    1   1
1    2   2
2    3   3
3    4   4
4    5   5
5    6   6
6    1   1
7    2   2
8    3   3
9    4   4
10   5   5
11   6   6
12   1   1
13   2   2
14   3   3
15   4   4
16   5   5
17   6   6

Answered By: perpetualstudent

Pandas extract every nth column that start with same string prefix

Question:

Answers:

alternative using a `MultiIndex`:

Pandas extract every nth column that start with same string prefix

Question:

Answers:

alternative using a MultiIndex:

alternative using a `MultiIndex`: