How to add more columns to dataframe extracting the columns that are inside one column which came from a CSV?

Question:

As recommended, I reduced the amount of text and optimized the question.

I have a .CSV with multiple columns, okay, that’s normal, but one of the columns have multiple columns inside, and I’m trying to get those columns out and add it as new columns to the dataframe, to work with it.

Just to ilustrate, this is the format of the table coming from the CSV:

just to ilustrate, look into second column

Those keys need to be new columns and the values new rows.

In the end, I would like something like this:

The second column is now integrated to the DF as new columns

  • This is the first row of the CSV, it contains the name of the columns:

time,"labels_stats","lockers","keys","panels","glass","std_glass","avg_glass","sand","std_sand","avg_sand","gas","std_gas","avg_gas","temperature","std_
temperature","avg_
temperature","cracks","std_cracks","avg_cracks","cracks_forjed"


  • Here is the second row, where de values begins(it begins in second row and goes until row 8500):

If you look closer, will realize that in the respective second column values, exists another columns inside with some sort of values. To be clear, they are inside one big quotation mark, begining in the "color=123(…) and finishing in the (..),std_box=101".

  • Highlighting the column that has multiple columns inside:

column1,"color=123,brightness=16,rowling=9,rowling_gone=5,clipper=304,avg_clipper=19,std_clipper=51.917883880861964,billedclipper=152,avg_billedclipper=9.5,std_intensity=25.958941940430982,billedbox=2,avg_billedbox=0.125,std_billedbox=0.3415650255319866,box=4,avg_box=0.25,std_box=0.6831300510639732
color=1251,brightness=33,rowling=2,rowling_gone=13,clipper=0,avg_clipper=0,std_clipper=0,billedclipper=0,avg_billedclipper=0,std_intensity=0,billedbox=0,avg_billedbox=0,std_billedbox=0,box=0,avg_box=0,std_box=101"
,"column3","column4","column5",column6,columnN


EDIT-1
Tried the first solution sugested by lil-solbs:

attrs_df = pd.DataFrame(
df["labels"].apply(
    lambda x: {
        y.split("=")[0]: y.split('=')[1] for y in x.split(',')
    }
  ).to_list()
)
attrs_df

It’s almost there, it retrieves only the first color attributes and creates a new dataframe, but it doesn’t extract the other colors and their respective attributes. And the name of the columns should end with the color value(everything in only one dataframe), like:

"color123","brightness123","rowling123","rowling_gone123"(…),"color1251","brightness1251","rowling1251","rowling_gone1251"(…)



EDIT-2

The sugestion made by Corralien worked!!!! I took the code and edit with the sugestion made by his comments, thankyou very very much @Corralien!!!
Thankyou too @lil-solbs!!!

The solution:

# Extract labels_stats Series and flatten it
df1 = (df.pop('labels_stats').str.split().explode()
         .str.extractall(r'(?P<key>[^=]+)=(?P<val>[^,]+),?')
         .droplevel('match'))

# Add the numeric id (123, 1251, etc)
df1['key'] += df1['val'].where(df1['key'] ==             
'color').ffill().astype(str)

# Reshape the dataframe as the original one
df1 = df1.pivot_table(index=df1.index, columns='key', values='val', 
sort=False, aggfunc='first')

# Get the expected output
out = pd.concat([df, df1], axis=1)
Asked By: Pdr

||

Answers:

This is not elegant, but starting with a dummy df like yours:

df = pd.DataFrame([
    {"time": 10, "labels": "color=123,brightness=456"}, 
    {"time": 10, "labels": "color=234,brightness=567"}
])

You can apply a lambda to split the labels column and make a new DataFrame like this:

attrs_df = pd.DataFrame(
    df["labels"].apply(
        lambda x: {
            y.split("=")[0]: y.split('=')[1] for y in x.split(',')
        }
    ).to_list()
)
attrs_df

attrs_df:


  color brightness
0   123 456
1   234 567

From there you can join the two DataFrames. If I had more time I’d make that more elegant. Maybe I’d use cross-sections for the first time if even possible

EDIT: Reading your question better (lots of text) I see you have multiple colors and want the color=value to have value in the rows. It’s a small tweak to above, but attrs_df would not split the key on =, but just be the key:

pd.DataFrame(
    df["labels"].apply(
        lambda x: {
            y: y.split('=')[1] for y in x.split(',')
        }
    ).to_list()
)
Answered By: lil-solbs

Try to process your column separately:

# Extract labels_stats Series and flatten it
df1 = (df.pop('labels_stats').str.split().explode()
         .str.extractall(r'(?P<key>[^=]+)=(?P<val>[^,]+),?')
         .droplevel('match'))

# Add the numeric id (123, 1251, etc)
df1['key'] += df1['val'].where(df1['key'] == 'color').ffill().astype(str)

# Reshape the dataframe as the original one
df1 = df1.pivot_table(index=df1.index, columns='key', values='val', sort=False)

# Get the expected output
out = pd.concat([df, df1], axis=1)

Output:

                  time       lockers       keys     panels       glass     std_glass   avg_glass           sand    std_sand  ...  std_clipper80  billedclipper80  avg_billedclipper80  billedbox80  avg_billedbox80  std_billedbox80   box80  avg_box80   std_box80
0  03/24/2018 00:00:00  77787.172081  97.857143  33.714286  686.967347  35284.503611  317.196164  937679.620975  181.610782  ...   89590.988446         293454.0          3811.090909       2879.0         37.38961       297.097457  7217.0  93.727273  765.390606

[1 rows x 124 columns]
Answered By: Corralien