How can I extract a domain name and insert it into a new Pandas column?

Question:

I have a Pandas dataframe with many columns, a subset of which is below:

df.info()

SQLDATE                  datetime64[ns]
SOURCEURL                object

df['SQLDATE', 'SOURCEURL'].sample()

SQLDATE    SOURCEURL
2017-01-08 http://www.huffingtonpost.co.uk/a/abc
2018-09-25 http://www.taiwannews.com.tw/a/news/123
2016-03-19 https://www.theguardian.com/a/2016/a/1/ab-bc
2015-12-12 https://nz.news.yahoo.com/world/a/3/a/
2017-04-07 https://www.thelocal.fr/2122/jkl
2019-02-21 http://today.az/news/a/123.html
2018-05-13 The BBC World News Report

I’m looking to create a column that can extract the domain name in order to get a new column that looks like this:

df.sample()

SQLDATE    SOURCEURL                               DOMAINNAME
2017-01-08 http://www.huffingtonpost.co.uk/a/abc   www.huffingtonpost.co.uk
2018-09-25 http://www.taiwannews.com.tw/a/news/123 www.taiwannews.com.tw
2016-03-19 https://www.theguardian.com/a...        www.theguardian.com
2015-12-12 https://nz.news.yahoo.com/world/a/3/a/  nz.news.yahoo.com
2017-04-07 https://www.thelocal.fr/2122/jkl        www.thelocal.fr
2019-02-21 http://today.az/news/a/123.html         today.az
2018-05-13 The BBC World News Report               The BBC World News Report

The dataframe does appear to be messy, where a few of the SOURCEURL fields simply contain text, no URL. I’d like to simply copy those values over into the DOMAINNAME column. I’m not too familiar with regular expressions, but this might be a case where it would apply.

Thanks for reviewing!

Asked By: Sepa

||

Answers:

This expression

https?://(?:www.)?([^/]+)

with this simple left boundary

https?://(?:www.)?

and this capturing group

([^/]+)

might return our desired domain names.

Demo

Answered By: Emma

Use urlparse:

from urllib.parse import urlparse

cell = # get cell from pandas df
domain = urlparse(cell).netloc
Answered By: mrzasa

We can use positive lookbehind ?<= and positive lookahead ?= with regex, to get everything between http:// OR https:// and the first /:

m = df['SOURCEURL'].str.extract('(?<=http://)(.*?)(?=/)|(?<=https://)(.*?)(?=/)')
m = m[0].fillna(m[1]).fillna(df['SOURCEURL'])

df['DOMAINNAME'] = m
      SQLDATE                                     SOURCEURL                 DOMAINNAME
0  2017-01-08         http://www.huffingtonpost.co.uk/a/abc   www.huffingtonpost.co.uk
1  2018-09-25       http://www.taiwannews.com.tw/a/news/123      www.taiwannews.com.tw
2  2016-03-19  https://www.theguardian.com/a/2016/a/1/ab-bc        www.theguardian.com
3  2015-12-12        https://nz.news.yahoo.com/world/a/3/a/          nz.news.yahoo.com
4  2017-04-07              https://www.thelocal.fr/2122/jkl            www.thelocal.fr
5  2019-02-21               http://today.az/news/a/123.html                   today.az
6  2018-05-13                     The BBC World News Report  The BBC World News Report

Edit: to add on mrzasa’s answer, we can also use apply with urllib.parse.urlparse:

from urllib.parse import urlparse

df["DOMAIN"] = df["SOURCEURL"].apply(lambda row: urlparse(row).netloc)
      SQLDATE                                     SOURCEURL  
0  2017-01-08         http://www.huffingtonpost.co.uk/a/abc   
1  2018-09-25       http://www.taiwannews.com.tw/a/news/123   
2  2016-03-19  https://www.theguardian.com/a/2016/a/1/ab-bc   
3  2015-12-12        https://nz.news.yahoo.com/world/a/3/a/   
4  2017-04-07              https://www.thelocal.fr/2122/jkl   
5  2019-02-21               http://today.az/news/a/123.html   
6  2018-05-13                     The BBC World News Report   

                     DOMAIN  
0  www.huffingtonpost.co.uk  
1     www.taiwannews.com.tw  
2       www.theguardian.com  
3         nz.news.yahoo.com  
4           www.thelocal.fr  
5                  today.az  
6                            
Answered By: Erfan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.