Pandas transpose raw data with inconsistent row pattern

Question:

I have the following raw data export:

import pandas as pd
df = pd.DataFrame({ 
     'Data': ['A15','abc','A25',50,'abc','A30',20,'def','A4','hijk','A',10,'mnop'],

    })
df

enter image description here

I am trying to transpose this raw data to a table with 3 columns: Name, Number and Text

I would like a row for every time A appears as this is the consistent pattern. After that there is always a text but only sometimes there is a number. If this number appears it is always the direct row after the A. My expected output is this:

enter image description here

Any ideas on how i can approach this? Thanks very much!

Asked By: SOK

||

Answers:

import re
import pandas as pd
import numpy as np

df = pd.DataFrame({'Data': ['A15', 'abc', 'A25', 50, 'abc', 'A30', 20, 'def', 'A4', 'hijk', 'A', 10, 'mnop']})

convert to list:

flat_list = df['Data'].tolist()

iterate over list, if element matches Ad+ add new sublist otherwise append to last sublist:

nested_list = []
while len(flat_list) > 0:
    element = flat_list.pop(0)
    if re.fullmatch("Ad*", str(element)):
        nested_list.append([element])
    else:
        nested_list[-1].append(element)

to list of dicts where ‘Number’ is np.NaN if a sublist has only two items:

as_records = [
    {'Name': l[0], 'Number': l[1], 'text': l[2]} if len(l) == 3 else {'Name': l[0], 'Number': np.NaN, 'text': l[1]} 
    for l in nested_list]

convert to DataFrame:

df_out = pd.DataFrame(as_records)

Which returns:

  Name  Number  text
0  A15     NaN   abc
1  A25    50.0   abc
2  A30    20.0   def
3   A4     NaN  hijk
4    A    10.0  mnop
Answered By: ewz93

You can define masks and use a pivot:

m1 = df['Data'].str.startswith('A')
m2 = m1.isna() #OR: pd.to_numeric(df['Data'], errors='coerce').notna()
m1 = m1.fillna(False)

df2 = (df
 .assign(index=m1.cumsum(),
         col=np.select([m1, m2], ['Name', 'Number'], 'Text')
        )
 .pivot(index='index', columns='col', values='Data')
)

output:

col   Name Number  Text
index                  
1      A15    NaN   abc
2      A25     50   abc
3      A30     20   def
4       A4    NaN  hijk
5        A     10  mnop

intermediates:

    Data     m1  m1(v2)     m2  m1(cumsum)
0    A15   True    True  False           1
1    abc  False   False  False           1
2    A25   True    True  False           2
3     50    NaN   False   True           2
4    abc  False   False  False           2
5    A30   True    True  False           3
6     20    NaN   False   True           3
7    def  False   False  False           3
8     A4   True    True  False           4
9   hijk  False   False  False           4
10     A   True    True  False           5
11    10    NaN   False   True           5
12  mnop  False   False  False           5
Answered By: mozway
def function1(dd:pd.DataFrame):
    ss1=dd.loc[df.Data.astype(str).str.isdigit(),"Data"]
    ss2=pd.Index(dd.iloc[1:].Data).difference(ss1)
    return pd.Series(dict(Name=dd.head(1).squeeze(),Number=ss1.max(),text=ss2.max()),name=dd.name)

col1=df.Data.astype(str).str.contains("A").cumsum()
df.groupby(col1).apply(function1).reset_index(drop=True)

out

  Name  Number  text
Data                   
0     A15     NaN   abc
1     A25    50.0   abc
2     A30    20.0   def
3      A4     NaN  hijk
4       A    10.0  mnop
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.