How can I extract a domain name and insert it into a new Pandas column?
Question:
I have a Pandas dataframe with many columns, a subset of which is below:
df.info()
SQLDATE datetime64[ns]
SOURCEURL object
df['SQLDATE', 'SOURCEURL'].sample()
SQLDATE SOURCEURL
2017-01-08 http://www.huffingtonpost.co.uk/a/abc
2018-09-25 http://www.taiwannews.com.tw/a/news/123
2016-03-19 https://www.theguardian.com/a/2016/a/1/ab-bc
2015-12-12 https://nz.news.yahoo.com/world/a/3/a/
2017-04-07 https://www.thelocal.fr/2122/jkl
2019-02-21 http://today.az/news/a/123.html
2018-05-13 The BBC World News Report
I’m looking to create a column that can extract the domain name in order to get a new column that looks like this:
df.sample()
SQLDATE SOURCEURL DOMAINNAME
2017-01-08 http://www.huffingtonpost.co.uk/a/abc www.huffingtonpost.co.uk
2018-09-25 http://www.taiwannews.com.tw/a/news/123 www.taiwannews.com.tw
2016-03-19 https://www.theguardian.com/a... www.theguardian.com
2015-12-12 https://nz.news.yahoo.com/world/a/3/a/ nz.news.yahoo.com
2017-04-07 https://www.thelocal.fr/2122/jkl www.thelocal.fr
2019-02-21 http://today.az/news/a/123.html today.az
2018-05-13 The BBC World News Report The BBC World News Report
The dataframe does appear to be messy, where a few of the SOURCEURL
fields simply contain text, no URL. I’d like to simply copy those values over into the DOMAINNAME
column. I’m not too familiar with regular expressions, but this might be a case where it would apply.
Thanks for reviewing!
Answers:
This expression
https?://(?:www.)?([^/]+)
with this simple left boundary
https?://(?:www.)?
and this capturing group
([^/]+)
might return our desired domain names.
Demo
Use urlparse
:
from urllib.parse import urlparse
cell = # get cell from pandas df
domain = urlparse(cell).netloc
We can use positive lookbehind ?<=
and positive lookahead ?=
with regex, to get everything between http://
OR https://
and the first /
:
m = df['SOURCEURL'].str.extract('(?<=http://)(.*?)(?=/)|(?<=https://)(.*?)(?=/)')
m = m[0].fillna(m[1]).fillna(df['SOURCEURL'])
df['DOMAINNAME'] = m
SQLDATE SOURCEURL DOMAINNAME
0 2017-01-08 http://www.huffingtonpost.co.uk/a/abc www.huffingtonpost.co.uk
1 2018-09-25 http://www.taiwannews.com.tw/a/news/123 www.taiwannews.com.tw
2 2016-03-19 https://www.theguardian.com/a/2016/a/1/ab-bc www.theguardian.com
3 2015-12-12 https://nz.news.yahoo.com/world/a/3/a/ nz.news.yahoo.com
4 2017-04-07 https://www.thelocal.fr/2122/jkl www.thelocal.fr
5 2019-02-21 http://today.az/news/a/123.html today.az
6 2018-05-13 The BBC World News Report The BBC World News Report
Edit: to add on mrzasa’s answer, we can also use apply
with urllib.parse.urlparse
:
from urllib.parse import urlparse
df["DOMAIN"] = df["SOURCEURL"].apply(lambda row: urlparse(row).netloc)
SQLDATE SOURCEURL
0 2017-01-08 http://www.huffingtonpost.co.uk/a/abc
1 2018-09-25 http://www.taiwannews.com.tw/a/news/123
2 2016-03-19 https://www.theguardian.com/a/2016/a/1/ab-bc
3 2015-12-12 https://nz.news.yahoo.com/world/a/3/a/
4 2017-04-07 https://www.thelocal.fr/2122/jkl
5 2019-02-21 http://today.az/news/a/123.html
6 2018-05-13 The BBC World News Report
DOMAIN
0 www.huffingtonpost.co.uk
1 www.taiwannews.com.tw
2 www.theguardian.com
3 nz.news.yahoo.com
4 www.thelocal.fr
5 today.az
6
I have a Pandas dataframe with many columns, a subset of which is below:
df.info()
SQLDATE datetime64[ns]
SOURCEURL object
df['SQLDATE', 'SOURCEURL'].sample()
SQLDATE SOURCEURL
2017-01-08 http://www.huffingtonpost.co.uk/a/abc
2018-09-25 http://www.taiwannews.com.tw/a/news/123
2016-03-19 https://www.theguardian.com/a/2016/a/1/ab-bc
2015-12-12 https://nz.news.yahoo.com/world/a/3/a/
2017-04-07 https://www.thelocal.fr/2122/jkl
2019-02-21 http://today.az/news/a/123.html
2018-05-13 The BBC World News Report
I’m looking to create a column that can extract the domain name in order to get a new column that looks like this:
df.sample()
SQLDATE SOURCEURL DOMAINNAME
2017-01-08 http://www.huffingtonpost.co.uk/a/abc www.huffingtonpost.co.uk
2018-09-25 http://www.taiwannews.com.tw/a/news/123 www.taiwannews.com.tw
2016-03-19 https://www.theguardian.com/a... www.theguardian.com
2015-12-12 https://nz.news.yahoo.com/world/a/3/a/ nz.news.yahoo.com
2017-04-07 https://www.thelocal.fr/2122/jkl www.thelocal.fr
2019-02-21 http://today.az/news/a/123.html today.az
2018-05-13 The BBC World News Report The BBC World News Report
The dataframe does appear to be messy, where a few of the SOURCEURL
fields simply contain text, no URL. I’d like to simply copy those values over into the DOMAINNAME
column. I’m not too familiar with regular expressions, but this might be a case where it would apply.
Thanks for reviewing!
This expression
https?://(?:www.)?([^/]+)
with this simple left boundary
https?://(?:www.)?
and this capturing group
([^/]+)
might return our desired domain names.
Demo
Use urlparse
:
from urllib.parse import urlparse
cell = # get cell from pandas df
domain = urlparse(cell).netloc
We can use positive lookbehind ?<=
and positive lookahead ?=
with regex, to get everything between http://
OR https://
and the first /
:
m = df['SOURCEURL'].str.extract('(?<=http://)(.*?)(?=/)|(?<=https://)(.*?)(?=/)')
m = m[0].fillna(m[1]).fillna(df['SOURCEURL'])
df['DOMAINNAME'] = m
SQLDATE SOURCEURL DOMAINNAME
0 2017-01-08 http://www.huffingtonpost.co.uk/a/abc www.huffingtonpost.co.uk
1 2018-09-25 http://www.taiwannews.com.tw/a/news/123 www.taiwannews.com.tw
2 2016-03-19 https://www.theguardian.com/a/2016/a/1/ab-bc www.theguardian.com
3 2015-12-12 https://nz.news.yahoo.com/world/a/3/a/ nz.news.yahoo.com
4 2017-04-07 https://www.thelocal.fr/2122/jkl www.thelocal.fr
5 2019-02-21 http://today.az/news/a/123.html today.az
6 2018-05-13 The BBC World News Report The BBC World News Report
Edit: to add on mrzasa’s answer, we can also use apply
with urllib.parse.urlparse
:
from urllib.parse import urlparse
df["DOMAIN"] = df["SOURCEURL"].apply(lambda row: urlparse(row).netloc)
SQLDATE SOURCEURL
0 2017-01-08 http://www.huffingtonpost.co.uk/a/abc
1 2018-09-25 http://www.taiwannews.com.tw/a/news/123
2 2016-03-19 https://www.theguardian.com/a/2016/a/1/ab-bc
3 2015-12-12 https://nz.news.yahoo.com/world/a/3/a/
4 2017-04-07 https://www.thelocal.fr/2122/jkl
5 2019-02-21 http://today.az/news/a/123.html
6 2018-05-13 The BBC World News Report
DOMAIN
0 www.huffingtonpost.co.uk
1 www.taiwannews.com.tw
2 www.theguardian.com
3 nz.news.yahoo.com
4 www.thelocal.fr
5 today.az
6