Scrape the absolute URL instead of a relative path in python
Question:
I’m trying to get all the href’s from a HTML code and store it in a list for future processing such as this:
Example URL: www.example-page-xl.com
<body>
<section>
<a href="/helloworld/index.php"> Hello World </a>
</section>
</body>
I’m using the following code to list the href’s:
import bs4 as bs4
import urllib.request
sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')
section = soup.section
for url in section.find_all('a'):
print(url.get('href'))
However I would like to store the URL as:
www.example-page-xl.com/helloworld/index.php and not just the relative path which is /helloworld/index.php
Appending/joining the URL with the relative path isn’t required since the dynamic links may vary when I join the URL and the relative path.
In a nutshell I would like to scrape the absolute URL and not relative paths alone (and without joining)
Answers:
urllib.parse.urljoin() might help. It does a join, but it is smart about it and handles both relative and absolute paths. Note this is python 3 code.
>>> import urllib.parse
>>> base = 'https://www.example-page-xl.com'
>>> urllib.parse.urljoin(base, '/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'
>>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'
In this case urlparse.urljoin helps you. You should modify your code like this-
import bs4 as bs4
import urllib.request
from urlparse import urljoin
web_url = 'https:www.example-page-xl.com'
sauce = urllib.request.urlopen(web_url).read()
soup = bs.BeautifulSoup(sauce,'lxml')
section = soup.section
for url in section.find_all('a'):
print urljoin(web_url,url.get('href'))
here urljoin manage absolute and relative paths.
I see the solution mentioned here to be the most robust.
import urllib.parse
def base_url(url, with_path=False):
parsed = urllib.parse.urlparse(url)
path = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
parsed = parsed._replace(path=path)
parsed = parsed._replace(params='')
parsed = parsed._replace(query='')
parsed = parsed._replace(fragment='')
return parsed.geturl()
I think another option is to go with _replace
method of urllib.parse.urlparse
Most of the time the baseurl
will change, so instead of declaring it with the fixed value, I use the URL from the source and change its path.
from urllib.parse import urlparse
old_link = "https://www.example-page-xl.com/old-path"
>>> "https://www.example-page-xl.com/old-path"
new_link = urlparse(link)._replace(path="new-path").geturl()
>>> "https://www.example-page-xl.com/new-path"
Here is the structure of url: scheme://netloc/path;parameters?query#fragment
. Find the documentation here
I’m trying to get all the href’s from a HTML code and store it in a list for future processing such as this:
Example URL: www.example-page-xl.com
<body>
<section>
<a href="/helloworld/index.php"> Hello World </a>
</section>
</body>
I’m using the following code to list the href’s:
import bs4 as bs4
import urllib.request
sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')
section = soup.section
for url in section.find_all('a'):
print(url.get('href'))
However I would like to store the URL as:
www.example-page-xl.com/helloworld/index.php and not just the relative path which is /helloworld/index.php
Appending/joining the URL with the relative path isn’t required since the dynamic links may vary when I join the URL and the relative path.
In a nutshell I would like to scrape the absolute URL and not relative paths alone (and without joining)
urllib.parse.urljoin() might help. It does a join, but it is smart about it and handles both relative and absolute paths. Note this is python 3 code.
>>> import urllib.parse
>>> base = 'https://www.example-page-xl.com'
>>> urllib.parse.urljoin(base, '/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'
>>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'
In this case urlparse.urljoin helps you. You should modify your code like this-
import bs4 as bs4
import urllib.request
from urlparse import urljoin
web_url = 'https:www.example-page-xl.com'
sauce = urllib.request.urlopen(web_url).read()
soup = bs.BeautifulSoup(sauce,'lxml')
section = soup.section
for url in section.find_all('a'):
print urljoin(web_url,url.get('href'))
here urljoin manage absolute and relative paths.
I see the solution mentioned here to be the most robust.
import urllib.parse
def base_url(url, with_path=False):
parsed = urllib.parse.urlparse(url)
path = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
parsed = parsed._replace(path=path)
parsed = parsed._replace(params='')
parsed = parsed._replace(query='')
parsed = parsed._replace(fragment='')
return parsed.geturl()
I think another option is to go with _replace
method of urllib.parse.urlparse
Most of the time the baseurl
will change, so instead of declaring it with the fixed value, I use the URL from the source and change its path.
from urllib.parse import urlparse
old_link = "https://www.example-page-xl.com/old-path"
>>> "https://www.example-page-xl.com/old-path"
new_link = urlparse(link)._replace(path="new-path").geturl()
>>> "https://www.example-page-xl.com/new-path"
Here is the structure of url: scheme://netloc/path;parameters?query#fragment
. Find the documentation here