Error on Getting Title for URL in Dataframe (Pandas / Python)
Question:
I’m trying to get the webpage titles for a column of URLs in a dataframe.
Using:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def geturl(x):
return (BeautifulSoup(urlopen(x)).title.get_text())
geturl('https://msn.com')
Returns:
‘MSN | Outlook, Office, Skype, Bing, Breaking News, and Latest Videos’
However, when actually working with a dataframe:
data = [['1001','https://msn.com'],['1002','https://google.com'],['1003','https://yahoo.com']]
df = pd.DataFrame(data, columns=['ID', 'URL'])
df
ID URL
0 1001 https://msn.com
1 1002 https://google.com
2 1003 https://yahoo.com
df['title'] = df['url'].apply(geturl())
Results in an error. Any help would be greatly appreciated.
Answers:
When I try to run your script I get below error:
File "C:UsersuserPycharmProjectstesttest.py", line 235, in <module>
df['title'] = df['url'].apply(geturl())
File "C:UsersuserPycharmProjectstestvenvlibsite-packagespandascoreframe.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "C:UsersuserPycharmProjectstestvenvlibsite-packagespandascoreindexesbase.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'url'
At your DF you setup column as URL but at below line you call with df["url"]
df['title'] = df['url'].apply(geturl())
Since its key sensitive its generating KeyError
I’m trying to get the webpage titles for a column of URLs in a dataframe.
Using:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def geturl(x):
return (BeautifulSoup(urlopen(x)).title.get_text())
geturl('https://msn.com')
Returns:
‘MSN | Outlook, Office, Skype, Bing, Breaking News, and Latest Videos’
However, when actually working with a dataframe:
data = [['1001','https://msn.com'],['1002','https://google.com'],['1003','https://yahoo.com']]
df = pd.DataFrame(data, columns=['ID', 'URL'])
df
ID URL
0 1001 https://msn.com
1 1002 https://google.com
2 1003 https://yahoo.com
df['title'] = df['url'].apply(geturl())
Results in an error. Any help would be greatly appreciated.
When I try to run your script I get below error:
File "C:UsersuserPycharmProjectstesttest.py", line 235, in <module>
df['title'] = df['url'].apply(geturl())
File "C:UsersuserPycharmProjectstestvenvlibsite-packagespandascoreframe.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "C:UsersuserPycharmProjectstestvenvlibsite-packagespandascoreindexesbase.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'url'
At your DF you setup column as URL but at below line you call with df["url"]
df['title'] = df['url'].apply(geturl())
Since its key sensitive its generating KeyError