Extract 'url' value from Pandas Series
Question:
I have a Pandas DataFrame with the following column called "image_versions2.candidates":
df_myposts['image_versions2.candidates']
That give me:
0 [{'width': 750, 'height': 498, 'url': 'https:/XXX'}]
1 NaN
2 [{'width': 750, 'height': 498, 'url': 'https:/YYY'}]
3 [{'width': 750, 'height': 498, 'url': 'https:/ZZZ'}]
I’m trying to extract the url into a new column called for example ‘image_url’.
I can extract a single URL with the following code:
df_myposts['image_versions2.candidates'][0][0]['url']
'https:/XXX'
But with the second row it give me the following error due to the NaN value:
df_myposts['image_versions2.candidates'][1][0]['url']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-64-3f0532195cb7> in <module>
----> 1 df_myposts['image_versions2.candidates'][1][0]['url']
TypeError: 'float' object is not subscriptable
I’m trying with some type of loop and if condition but I’m having similar error messages:
for i in df_myposts['image_versions2.candidates']:
if type(i[0]) == 'list':
Which could be the better option to perform this without dropping NaN rows?
I have another column with the Id so I want to maintain the relation id <-> url.
Thanks
Answers:
Use:
df = pd.DataFrame({'a':[1,2,3], 'b':[[{'width': 750, 'height': 498, 'url': 'https:/XXX'}], [{'width': 750, 'height': 498, 'url': 'https:/YYY'}], None]})
# df.dropna(inplace = True) #drop rows with null values
# to preserve rows with NaN, first replace NaN values with a scalar/dict value
df.fillna('null', inplace=True)
df['c'] = df['b'].apply(lambda x: [y['url'] if isinstance(x, list) else 'null' for y in x])
df['c'] = df['c'].apply(lambda x:x[0]) #get only the url from the list
#Output:
a b c
0 1 [{'width': 750, 'height': 498, 'url': 'https:/... https:/XXX
1 2 [{'width': 750, 'height': 498, 'url': 'https:/... https:/YYY
2 3 null null
We can use list comprehension
with iterrows
here to extract the URL
tag:
df.fillna('None', inplace=True)
df['image_url'] = [
d['image_versions2.candidates']['url'] if d['image_versions2.candidates'] != 'None' else 'None' for idx, d in df.iterrows()
]
print(df)
image_versions2.candidates image_url
0 {'width': 750, 'height': 498, 'url': 'https:/X... https:/XXX
1 None None
2 {'width': 750, 'height': 498, 'url': 'https:/Y... https:/YYY
3 {'width': 750, 'height': 498, 'url': 'https:/Z... https:/ZZZ
Using @amanb’s setup dataframe
df = pd.DataFrame({
'a':[1,2,3],
'b':[
[{'width': 750, 'height': 498, 'url': 'https:/XXX'}],
[{'width': 750, 'height': 498, 'url': 'https:/YYY'}],
None
]
})
You can use str
accessor of a pandas.Series
to grab the first element of a list. Then use to_dict
and from_dict
pd.DataFrame.from_dict(df.b.dropna().str[0].to_dict(), orient='index')
To get
width height url
0 750 498 https:/XXX
1 750 498 https:/YYY
You can use join
to add to df
df.join(pd.DataFrame.from_dict(df.b.dropna().str[0].to_dict(), orient='index'))
a b width height url
0 1 [{'width': 750, 'height': 498, 'url': 'https:/... 750.0 498.0 https:/XXX
1 2 [{'width': 750, 'height': 498, 'url': 'https:/... 750.0 498.0 https:/YYY
2 3 None NaN NaN NaN
Or you can replace the column
df.assign(b=pd.DataFrame.from_dict(df.b.dropna().str[0].to_dict(), orient='index').url)
a b
0 1 https:/XXX
1 2 https:/YYY
2 3 NaN
My actual recommendation
But my favorite is using pd.io.json.json_normalize
in place of the dictionary magic.
df.assign(b=pd.io.json.json_normalize(df.b.dropna().str[0]).url)
a b
0 1 https:/XXX
1 2 https:/YYY
2 3 NaN
I have a Pandas DataFrame with the following column called "image_versions2.candidates":
df_myposts['image_versions2.candidates']
That give me:
0 [{'width': 750, 'height': 498, 'url': 'https:/XXX'}]
1 NaN
2 [{'width': 750, 'height': 498, 'url': 'https:/YYY'}]
3 [{'width': 750, 'height': 498, 'url': 'https:/ZZZ'}]
I’m trying to extract the url into a new column called for example ‘image_url’.
I can extract a single URL with the following code:
df_myposts['image_versions2.candidates'][0][0]['url']
'https:/XXX'
But with the second row it give me the following error due to the NaN value:
df_myposts['image_versions2.candidates'][1][0]['url']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-64-3f0532195cb7> in <module>
----> 1 df_myposts['image_versions2.candidates'][1][0]['url']
TypeError: 'float' object is not subscriptable
I’m trying with some type of loop and if condition but I’m having similar error messages:
for i in df_myposts['image_versions2.candidates']:
if type(i[0]) == 'list':
Which could be the better option to perform this without dropping NaN rows?
I have another column with the Id so I want to maintain the relation id <-> url.
Thanks
Use:
df = pd.DataFrame({'a':[1,2,3], 'b':[[{'width': 750, 'height': 498, 'url': 'https:/XXX'}], [{'width': 750, 'height': 498, 'url': 'https:/YYY'}], None]})
# df.dropna(inplace = True) #drop rows with null values
# to preserve rows with NaN, first replace NaN values with a scalar/dict value
df.fillna('null', inplace=True)
df['c'] = df['b'].apply(lambda x: [y['url'] if isinstance(x, list) else 'null' for y in x])
df['c'] = df['c'].apply(lambda x:x[0]) #get only the url from the list
#Output:
a b c
0 1 [{'width': 750, 'height': 498, 'url': 'https:/... https:/XXX
1 2 [{'width': 750, 'height': 498, 'url': 'https:/... https:/YYY
2 3 null null
We can use list comprehension
with iterrows
here to extract the URL
tag:
df.fillna('None', inplace=True)
df['image_url'] = [
d['image_versions2.candidates']['url'] if d['image_versions2.candidates'] != 'None' else 'None' for idx, d in df.iterrows()
]
print(df)
image_versions2.candidates image_url
0 {'width': 750, 'height': 498, 'url': 'https:/X... https:/XXX
1 None None
2 {'width': 750, 'height': 498, 'url': 'https:/Y... https:/YYY
3 {'width': 750, 'height': 498, 'url': 'https:/Z... https:/ZZZ
Using @amanb’s setup dataframe
df = pd.DataFrame({
'a':[1,2,3],
'b':[
[{'width': 750, 'height': 498, 'url': 'https:/XXX'}],
[{'width': 750, 'height': 498, 'url': 'https:/YYY'}],
None
]
})
You can use str
accessor of a pandas.Series
to grab the first element of a list. Then use to_dict
and from_dict
pd.DataFrame.from_dict(df.b.dropna().str[0].to_dict(), orient='index')
To get
width height url
0 750 498 https:/XXX
1 750 498 https:/YYY
You can use join
to add to df
df.join(pd.DataFrame.from_dict(df.b.dropna().str[0].to_dict(), orient='index'))
a b width height url
0 1 [{'width': 750, 'height': 498, 'url': 'https:/... 750.0 498.0 https:/XXX
1 2 [{'width': 750, 'height': 498, 'url': 'https:/... 750.0 498.0 https:/YYY
2 3 None NaN NaN NaN
Or you can replace the column
df.assign(b=pd.DataFrame.from_dict(df.b.dropna().str[0].to_dict(), orient='index').url)
a b
0 1 https:/XXX
1 2 https:/YYY
2 3 NaN
My actual recommendation
But my favorite is using pd.io.json.json_normalize
in place of the dictionary magic.
df.assign(b=pd.io.json.json_normalize(df.b.dropna().str[0]).url)
a b
0 1 https:/XXX
1 2 https:/YYY
2 3 NaN