How to split one columns into multiple columns and give a mark to the columns?


There’s a dataframe below:

|    Value|
|X1A14    |
|X20P79   |
|A50B7P60 |

items in the value column do not have fixed length. For example, X1A14 is consist of two words, which are X1 and A14. A50B7P60 are A50, B7 and P60.

I want to split every character, but I need to keep the character, like this:

|    Value|  A|  B|  C| D|
|X1A14    |X1 |A14|   |  |
|X20P79   |X20|P79|   |  |
|A50B7P60 |A50|B7 |P60|  |
|G24C5C6B8|G24|C5 |C6 |B8|

Finally, I want to make a mark for every column. I cannot confirm how many columns are, because there are four words to combine into an item in the last, so we have four columns to mark in this case.

Below is the final output:

|    Value|  A|mark1|  B|mark2|  C|mark3| D|mark4|
|X1A14    |X1 |    A|A14|    B|   |    C|  |    D|
|X20P79   |X20|    A|P79|    B|   |    C|  |    D|
|A50B7P60 |A50|    A|B7 |    B|P60|    C|  |    D|
|G24C5C6B8|G24|    A|C5 |    B|C6 |    C|B8|    D|

I had tried split function, but it cannot keep the word delimeter left…..

Asked By: jasondesu



I suppose you are trying to match every substring that starts with an uppercase character and ends before the next uppercase character or the end of the string.

You can use extractall with regular expression pattern ([A-Z][0-9]+) as follows

import pandas as pd

# sample data
df = pd.DataFrame({
    'value': ['X1A14','X20P79','A50B7P60','G24C5C6B8']

# extract
extractions = df['value'].str.extractall('([A-Z][0-9]+)')

# reshape
extractions['mark'] = extractions.index.get_level_values(1).values
extractions = extractions.rename(columns={0: 'value'}).unstack().swaplevel(axis=1).sort_index(axis=1)
extractions.columns = [col[0] if col[1]=='group' else col[1]+str(col[0]) for col in extractions.columns.values]

# append to original data
pd.concat([df, extractions], axis=1)

which results in

       value  mark0 value0  mark1 value1  mark2 value2  mark3 value3
0      X1A14    0.0     X1    1.0    A14    NaN    NaN    NaN    NaN
1     X20P79    0.0    X20    1.0    P79    NaN    NaN    NaN    NaN
2   A50B7P60    0.0    A50    1.0     B7    2.0    P60    NaN    NaN
3  G24C5C6B8    0.0    G24    1.0     C5    2.0     C6    3.0     B8

This is slightly different to your expected result because it uses numeric identifiers instead of characters to identify each match. You did not specify whether you have a variable amount of substrings in column value (given that this is only an excerpt of your data), so this may be more robust.

Answered By: harryhaller

You can use str.split with expand = True and regex = (?!^)(?=D+) to create columns and then create mark_cols and then finally concat.

t = df["Value"].str.split("(?!^)(?=D+)", expand=True).fillna("")
mark_cols = ["mark" + str(x + 1) for x in t.columns]
t.columns = x: chr(ord("A") + x))
t[mark_cols] = pd.DataFrame(
    dict(zip(mark_cols, [[col] * len(t.columns) for col in t.columns]))
out = pd.concat([df, t], axis=1)


       Value    A    B    C   D mark1 mark2 mark3 mark4
0      X1A14   X1  A14              A     B     C     D
1     X20P79  X20  P79              A     B     C     D
2   A50B7P60  A50   B7  P60         A     B     C     D
3  G24C5C6B8  G24   C5   C6  B8     A     B     C     D
Answered By: SomeDude
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.