How to add more columns to dataframe extracting the columns that are inside one column which came from a CSV?
Question:
As recommended, I reduced the amount of text and optimized the question.
I have a .CSV with multiple columns, okay, that’s normal, but one of the columns have multiple columns inside, and I’m trying to get those columns out and add it as new columns to the dataframe, to work with it.
Just to ilustrate, this is the format of the table coming from the CSV:
Those keys need to be new columns and the values new rows.
In the end, I would like something like this:
- This is the first row of the CSV, it contains the name of the columns:
time,"labels_stats","lockers","keys","panels","glass","std_glass","avg_glass","sand","std_sand","avg_sand","gas","std_gas","avg_gas","temperature","std_
temperature","avg_
temperature","cracks","std_cracks","avg_cracks","cracks_forjed"
- Here is the second row, where de values begins(it begins in second row and goes until row 8500):
If you look closer, will realize that in the respective second column values, exists another columns inside with some sort of values. To be clear, they are inside one big quotation mark, begining in the "color=123(…) and finishing in the (..),std_box=101".
- Highlighting the column that has multiple columns inside:
column1,"color=123,brightness=16,rowling=9,rowling_gone=5,clipper=304,avg_clipper=19,std_clipper=51.917883880861964,billedclipper=152,avg_billedclipper=9.5,std_intensity=25.958941940430982,billedbox=2,avg_billedbox=0.125,std_billedbox=0.3415650255319866,box=4,avg_box=0.25,std_box=0.6831300510639732
color=1251,brightness=33,rowling=2,rowling_gone=13,clipper=0,avg_clipper=0,std_clipper=0,billedclipper=0,avg_billedclipper=0,std_intensity=0,billedbox=0,avg_billedbox=0,std_billedbox=0,box=0,avg_box=0,std_box=101","column3","column4","column5",column6,columnN
EDIT-1
Tried the first solution sugested by lil-solbs:
attrs_df = pd.DataFrame(
df["labels"].apply(
lambda x: {
y.split("=")[0]: y.split('=')[1] for y in x.split(',')
}
).to_list()
)
attrs_df
It’s almost there, it retrieves only the first color attributes and creates a new dataframe, but it doesn’t extract the other colors and their respective attributes. And the name of the columns should end with the color value(everything in only one dataframe), like:
"color123","brightness123","rowling123","rowling_gone123"(…),"color1251","brightness1251","rowling1251","rowling_gone1251"(…)
EDIT-2
The sugestion made by Corralien worked!!!! I took the code and edit with the sugestion made by his comments, thankyou very very much @Corralien!!!
Thankyou too @lil-solbs!!!
The solution:
# Extract labels_stats Series and flatten it
df1 = (df.pop('labels_stats').str.split().explode()
.str.extractall(r'(?P<key>[^=]+)=(?P<val>[^,]+),?')
.droplevel('match'))
# Add the numeric id (123, 1251, etc)
df1['key'] += df1['val'].where(df1['key'] ==
'color').ffill().astype(str)
# Reshape the dataframe as the original one
df1 = df1.pivot_table(index=df1.index, columns='key', values='val',
sort=False, aggfunc='first')
# Get the expected output
out = pd.concat([df, df1], axis=1)
Answers:
This is not elegant, but starting with a dummy df like yours:
df = pd.DataFrame([
{"time": 10, "labels": "color=123,brightness=456"},
{"time": 10, "labels": "color=234,brightness=567"}
])
You can apply a lambda to split the labels
column and make a new DataFrame like this:
attrs_df = pd.DataFrame(
df["labels"].apply(
lambda x: {
y.split("=")[0]: y.split('=')[1] for y in x.split(',')
}
).to_list()
)
attrs_df
attrs_df
:
color brightness
0 123 456
1 234 567
From there you can join the two DataFrames. If I had more time I’d make that more elegant. Maybe I’d use cross-sections for the first time if even possible
EDIT: Reading your question better (lots of text) I see you have multiple colors and want the color=value
to have value
in the rows. It’s a small tweak to above, but attrs_df
would not split the key on =
, but just be the key:
pd.DataFrame(
df["labels"].apply(
lambda x: {
y: y.split('=')[1] for y in x.split(',')
}
).to_list()
)
Try to process your column separately:
# Extract labels_stats Series and flatten it
df1 = (df.pop('labels_stats').str.split().explode()
.str.extractall(r'(?P<key>[^=]+)=(?P<val>[^,]+),?')
.droplevel('match'))
# Add the numeric id (123, 1251, etc)
df1['key'] += df1['val'].where(df1['key'] == 'color').ffill().astype(str)
# Reshape the dataframe as the original one
df1 = df1.pivot_table(index=df1.index, columns='key', values='val', sort=False)
# Get the expected output
out = pd.concat([df, df1], axis=1)
Output:
time lockers keys panels glass std_glass avg_glass sand std_sand ... std_clipper80 billedclipper80 avg_billedclipper80 billedbox80 avg_billedbox80 std_billedbox80 box80 avg_box80 std_box80
0 03/24/2018 00:00:00 77787.172081 97.857143 33.714286 686.967347 35284.503611 317.196164 937679.620975 181.610782 ... 89590.988446 293454.0 3811.090909 2879.0 37.38961 297.097457 7217.0 93.727273 765.390606
[1 rows x 124 columns]
As recommended, I reduced the amount of text and optimized the question.
I have a .CSV with multiple columns, okay, that’s normal, but one of the columns have multiple columns inside, and I’m trying to get those columns out and add it as new columns to the dataframe, to work with it.
Just to ilustrate, this is the format of the table coming from the CSV:
Those keys need to be new columns and the values new rows.
In the end, I would like something like this:
- This is the first row of the CSV, it contains the name of the columns:
time,"labels_stats","lockers","keys","panels","glass","std_glass","avg_glass","sand","std_sand","avg_sand","gas","std_gas","avg_gas","temperature","std_
temperature","avg_
temperature","cracks","std_cracks","avg_cracks","cracks_forjed"
- Here is the second row, where de values begins(it begins in second row and goes until row 8500):
If you look closer, will realize that in the respective second column values, exists another columns inside with some sort of values. To be clear, they are inside one big quotation mark, begining in the "color=123(…) and finishing in the (..),std_box=101".
- Highlighting the column that has multiple columns inside:
column1,"color=123,brightness=16,rowling=9,rowling_gone=5,clipper=304,avg_clipper=19,std_clipper=51.917883880861964,billedclipper=152,avg_billedclipper=9.5,std_intensity=25.958941940430982,billedbox=2,avg_billedbox=0.125,std_billedbox=0.3415650255319866,box=4,avg_box=0.25,std_box=0.6831300510639732
color=1251,brightness=33,rowling=2,rowling_gone=13,clipper=0,avg_clipper=0,std_clipper=0,billedclipper=0,avg_billedclipper=0,std_intensity=0,billedbox=0,avg_billedbox=0,std_billedbox=0,box=0,avg_box=0,std_box=101","column3","column4","column5",column6,columnN
EDIT-1
Tried the first solution sugested by lil-solbs:
attrs_df = pd.DataFrame(
df["labels"].apply(
lambda x: {
y.split("=")[0]: y.split('=')[1] for y in x.split(',')
}
).to_list()
)
attrs_df
It’s almost there, it retrieves only the first color attributes and creates a new dataframe, but it doesn’t extract the other colors and their respective attributes. And the name of the columns should end with the color value(everything in only one dataframe), like:
"color123","brightness123","rowling123","rowling_gone123"(…),"color1251","brightness1251","rowling1251","rowling_gone1251"(…)
EDIT-2
The sugestion made by Corralien worked!!!! I took the code and edit with the sugestion made by his comments, thankyou very very much @Corralien!!!
Thankyou too @lil-solbs!!!
The solution:
# Extract labels_stats Series and flatten it
df1 = (df.pop('labels_stats').str.split().explode()
.str.extractall(r'(?P<key>[^=]+)=(?P<val>[^,]+),?')
.droplevel('match'))
# Add the numeric id (123, 1251, etc)
df1['key'] += df1['val'].where(df1['key'] ==
'color').ffill().astype(str)
# Reshape the dataframe as the original one
df1 = df1.pivot_table(index=df1.index, columns='key', values='val',
sort=False, aggfunc='first')
# Get the expected output
out = pd.concat([df, df1], axis=1)
This is not elegant, but starting with a dummy df like yours:
df = pd.DataFrame([
{"time": 10, "labels": "color=123,brightness=456"},
{"time": 10, "labels": "color=234,brightness=567"}
])
You can apply a lambda to split the labels
column and make a new DataFrame like this:
attrs_df = pd.DataFrame(
df["labels"].apply(
lambda x: {
y.split("=")[0]: y.split('=')[1] for y in x.split(',')
}
).to_list()
)
attrs_df
attrs_df
:
color brightness
0 123 456
1 234 567
From there you can join the two DataFrames. If I had more time I’d make that more elegant. Maybe I’d use cross-sections for the first time if even possible
EDIT: Reading your question better (lots of text) I see you have multiple colors and want the color=value
to have value
in the rows. It’s a small tweak to above, but attrs_df
would not split the key on =
, but just be the key:
pd.DataFrame(
df["labels"].apply(
lambda x: {
y: y.split('=')[1] for y in x.split(',')
}
).to_list()
)
Try to process your column separately:
# Extract labels_stats Series and flatten it
df1 = (df.pop('labels_stats').str.split().explode()
.str.extractall(r'(?P<key>[^=]+)=(?P<val>[^,]+),?')
.droplevel('match'))
# Add the numeric id (123, 1251, etc)
df1['key'] += df1['val'].where(df1['key'] == 'color').ffill().astype(str)
# Reshape the dataframe as the original one
df1 = df1.pivot_table(index=df1.index, columns='key', values='val', sort=False)
# Get the expected output
out = pd.concat([df, df1], axis=1)
Output:
time lockers keys panels glass std_glass avg_glass sand std_sand ... std_clipper80 billedclipper80 avg_billedclipper80 billedbox80 avg_billedbox80 std_billedbox80 box80 avg_box80 std_box80
0 03/24/2018 00:00:00 77787.172081 97.857143 33.714286 686.967347 35284.503611 317.196164 937679.620975 181.610782 ... 89590.988446 293454.0 3811.090909 2879.0 37.38961 297.097457 7217.0 93.727273 765.390606
[1 rows x 124 columns]