Pandas string extract from a dataframe with strings resembling dictionaries
Question:
I am looking to use the Pandas string extract feature.
I have a dataframe like this:
lista=[ "{'FIRST_id': 'awe', 'THIS_id': 'awec_20230222_1626_i0ov0w', 'NOTTHIS_id': 'awep_20230222_1628_p8f5hd52u3oknc24'}","{'FIRST_id': 'awe', 'THIS_id': 'awec_20230222_1626_i0ov0w', 'NOTTHIS_id': 'awep_20230222_1641_jwjajtals49wc88p'}"]
dfpack=pd.DataFrame(lista,columns=["awesome_config"])
print(dfpack)
So in the column "awesome_config" I have some string with some information:
awesome_config
0 {'FIRST_id': 'awe', 'THIS_id': 'awec_20230222...
1 {'FIRST_id': 'awe', 'THIS_id': 'awec_20230222...
I want to get only the "THIS_id" info on a column.
Therefore what I want to get is a dataframe with:
THIS_id
awec_20230222_1626_i0ov0w
awec_20230222_1626_i0ov0w
I have been trying something like:
#dd=dfpack['awesome_config'].str.extract(pat= "({'FIRST_id':'awe', 'THIS_id':).")
dd=dfpack['awesome_config'].str.extract(pat= "({'FIRST_id':'awe').")
print(dd)
But they all give me a dataframe with NaNs.
How can I use extract correctly here?
Edit
I have come with this:
dd=dfpack['awesome_config'].str.extract(r"^({'FIRST_id': 'awe', 'THIS_id': )(?P<THIS_id>.*), 'NOTTHIS_id':(?P<restofit>).* ")
but now I got:
0 'awec_20230222_1626_i0ov0w'
1 'awec_20230222_1626_i0ov0w'
Name: THIS_id, dtype: object
so the quotations are still there, I need it without quotations
Answers:
You can use ast.literal_eval
to evaluate the string into dict and then use str.get (str[])
to get to the desired key:
from ast import literal_eval
key = 'THIS_id'
dd=pd.DataFrame({key:dfpack['awesome_config'].apply(literal_eval).str[key]})
print(dd)
THIS_id
0 awec_20230222_1626_i0ov0w
1 awec_20230222_1626_i0ov0w
I am looking to use the Pandas string extract feature.
I have a dataframe like this:
lista=[ "{'FIRST_id': 'awe', 'THIS_id': 'awec_20230222_1626_i0ov0w', 'NOTTHIS_id': 'awep_20230222_1628_p8f5hd52u3oknc24'}","{'FIRST_id': 'awe', 'THIS_id': 'awec_20230222_1626_i0ov0w', 'NOTTHIS_id': 'awep_20230222_1641_jwjajtals49wc88p'}"]
dfpack=pd.DataFrame(lista,columns=["awesome_config"])
print(dfpack)
So in the column "awesome_config" I have some string with some information:
awesome_config
0 {'FIRST_id': 'awe', 'THIS_id': 'awec_20230222...
1 {'FIRST_id': 'awe', 'THIS_id': 'awec_20230222...
I want to get only the "THIS_id" info on a column.
Therefore what I want to get is a dataframe with:
THIS_id
awec_20230222_1626_i0ov0w
awec_20230222_1626_i0ov0w
I have been trying something like:
#dd=dfpack['awesome_config'].str.extract(pat= "({'FIRST_id':'awe', 'THIS_id':).")
dd=dfpack['awesome_config'].str.extract(pat= "({'FIRST_id':'awe').")
print(dd)
But they all give me a dataframe with NaNs.
How can I use extract correctly here?
Edit
I have come with this:
dd=dfpack['awesome_config'].str.extract(r"^({'FIRST_id': 'awe', 'THIS_id': )(?P<THIS_id>.*), 'NOTTHIS_id':(?P<restofit>).* ")
but now I got:
0 'awec_20230222_1626_i0ov0w'
1 'awec_20230222_1626_i0ov0w'
Name: THIS_id, dtype: object
so the quotations are still there, I need it without quotations
You can use ast.literal_eval
to evaluate the string into dict and then use str.get (str[])
to get to the desired key:
from ast import literal_eval
key = 'THIS_id'
dd=pd.DataFrame({key:dfpack['awesome_config'].apply(literal_eval).str[key]})
print(dd)
THIS_id
0 awec_20230222_1626_i0ov0w
1 awec_20230222_1626_i0ov0w