Extracting from Pandas Column with regex pattern

Question:

I have a Pandas Dataframe with the following structure:

pd.DataFrame([None, '1 RB, 2 TE, 2 WR', '1 RB, 1 TE, 3 WR', '1 RB, 3 TE, 1 WR', '1 RB, 0 TE, 4 WR', '2 RB, 1 TE, 2 WR', '2 RB, 2 TE, 1 WR', '1 RB, 2 TE, 1 WR,1 P,2 LB,1 LS,3 DB', '6 OL, 2 RB, 2 TE, 0 WR'])
RB
None
1 RB, 2 TE, 2 WR
1 RB, 1 TE, 3 WR
1 RB, 1 TE, 3 WR
1 RB, 0 TE, 4 WR

Ideally, I would prefer to split the column into the following format:

RB TE WR P LB LS DB OL
0 0 0 0 0 0 0 0
1 2 2 0 0 0 0 0
1 1 3 0 0 0 0 0
1 3 1 0 0 0 0 0
1 0 4 0 0 0 0 0

Where each of the original column values is parsed based on the label ("1 RB" would be the value 1 in the column "RB"). The pattern will always be [# position].

How would I accomplish this? Each column value in the original dataframe column is one long string, so it isn’t already an array or something. Additionally, not every value in the original dataframe column follows the same order; i.e. there isn’t a common pattern in the order of RB, TE, WR– if there isn’t a value, the string does not include "0 WR" for example.

Asked By: jwald3

||

Answers:

try this:

def make_dict(g: pd.DataFrame):
    res = dict(g.values[:,[-1,0]])
    return res

grouped = df[0].str.extractall(r'(d+)s(w+)').groupby(level=0)
tmp = grouped.apply(make_dict)
result = pd.DataFrame([*tmp], index=tmp.index).reindex(df.index)
print(result)

>>>

    RB  TE  WR  P   LB  LS  DB  OL
0   NaN NaN NaN NaN NaN NaN NaN NaN
1   1   2   2   NaN NaN NaN NaN NaN
2   1   1   3   NaN NaN NaN NaN NaN
3   1   3   1   NaN NaN NaN NaN NaN
4   1   0   4   NaN NaN NaN NaN NaN
5   2   1   2   NaN NaN NaN NaN NaN
6   2   2   1   NaN NaN NaN NaN NaN
7   1   2   1   1   2   1   3   NaN
8   2   2   0   NaN NaN NaN NaN 6
Answered By: ziying35

Here is a step by step process to do it assuming that the pattern is # position, # position ...

import pandas as pd
df = pd.DataFrame([None, '1 RB, 2 TE, 2 WR', '1 RB, 1 TE, 3 WR', '1 RB, 3 TE, 1 WR', '1 RB, 0 TE, 4 WR', '2 RB, 1 TE, 2 WR', '2 RB, 2 TE, 1 WR', '1 RB, 2 TE, 1 WR,1 P,2 LB,1 LS,3 DB', '6 OL, 2 RB, 2 TE, 0 WR'])

# create a list of dictionaries
rows = []
for i, r in df.iterrows():
    data = r[0]
    try:
        # assuming the items are comma separated
        items = data.split(',')
    except:
        # ignore data like None
        continue

    row = {}
    for item in items:
        # pattern: # position
        value, key = item.strip().split()
        row[key] = value
        rows.append(row)
 

# convert list of dictionaries to dataframe
new_df = pd.DataFrame(rows)
print(new_df)
Answered By: Sagun Shrestha
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.