How to extract link to Package Sources from Arch User Repository (AUR) website
Question:
I’m using BeautifulSoup to extract this line:
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
from a webpage.
<div>
<ul id="pkgsrcslist">
<li>
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
</li>
</ul>
</div>
Specifically, I want this part: iwgtk-0.8.tar.gz
I’ve written this code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
url="https://aur.archlinux.org/packages/iwgtk"
#url=sys.argv[1]
page = requests.get(url)
if page.status_code ==200:
soup = BeautifulSoup(page.text, 'html.parser')
urls = []
# loop over the [li] tags
for tag in soup.find_all('li'):
atag = tag.find('a')
try:
if 'href' in atag.attrs:
url = atag.get('href').contents[0]
urls.append(url)
except:
pass
# print all the urls stored in the urls list
for url in urls:
print(url)
and I assume it is this line
url = atag.get('href').contents[0]
that fails. I’ve tried
url = atag.get('a').contents[0]
but that failed too.
Answers:
Try to select your elements more specific:
soup.find('ul',{'id':'pkgsrcslist'}).find_all('a')
or more comfortable via css selector
soup.select('#pkgsrcslist a')
and use get('href')
to get the url or text
/ get_text()
to get its text or use both and store as key value in dict
:
...
soup = BeautifulSoup(page.text, 'html.parser')
pkgs = {}
for tag in soup.select('#pkgsrcslist a'):
print('url: ' +tag.get('href'))
print('text: ' + tag.text)
### update your a dict of package versions and links
pkgs.update({
tag.text:tag.get('href')
})
Example
from bs4 import BeautifulSoup
import requests
url="https://aur.archlinux.org/packages/iwgtk"
page = requests.get(url)
if page.status_code ==200:
soup = BeautifulSoup(page.text, 'html.parser')
pkgs = {}
for tag in soup.select('#pkgsrcslist a'):
pkgs.update({
tag.text:tag.get('href')
})
print(pkgs)
Output
{'iwgtk-0.8.tar.gz': 'https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz'}
You can also query the AUR repository using the official Aurweb RPC interface using the type search
with a search-term as arg
:
https://aur.archlinux.org/rpc/?v=5&type=search&arg=iwgtk
It returns JSON by default:
{
"resultcount": 1,
"results": [
{
"Description": "Lightweight wireless network management GUI (front-end for iwd)",
"FirstSubmitted": 1597306328,
"ID": 1124939,
"LastModified": 1660234078,
"Maintainer": "J-Lentz",
"Name": "iwgtk",
"NumVotes": 19,
"OutOfDate": null,
"PackageBase": "iwgtk",
"PackageBaseID": 156689,
"Popularity": 1.748972,
"URL": "https://github.com/J-Lentz/iwgtk",
"URLPath": "/cgit/aur.git/snapshot/iwgtk.tar.gz",
"Version": "0.8-2"
}
],
"type": "search",
"version": 5
}
Your wanted information can be found at JSON-path .results[].URLPath
.
See also:
I’m using BeautifulSoup to extract this line:
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
from a webpage.
<div>
<ul id="pkgsrcslist">
<li>
<a href="https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz">iwgtk-0.8.tar.gz</a>
</li>
</ul>
</div>
Specifically, I want this part: iwgtk-0.8.tar.gz
I’ve written this code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
url="https://aur.archlinux.org/packages/iwgtk"
#url=sys.argv[1]
page = requests.get(url)
if page.status_code ==200:
soup = BeautifulSoup(page.text, 'html.parser')
urls = []
# loop over the [li] tags
for tag in soup.find_all('li'):
atag = tag.find('a')
try:
if 'href' in atag.attrs:
url = atag.get('href').contents[0]
urls.append(url)
except:
pass
# print all the urls stored in the urls list
for url in urls:
print(url)
and I assume it is this line
url = atag.get('href').contents[0]
that fails. I’ve tried
url = atag.get('a').contents[0]
but that failed too.
Try to select your elements more specific:
soup.find('ul',{'id':'pkgsrcslist'}).find_all('a')
or more comfortable via css selector
soup.select('#pkgsrcslist a')
and use get('href')
to get the url or text
/ get_text()
to get its text or use both and store as key value in dict
:
...
soup = BeautifulSoup(page.text, 'html.parser')
pkgs = {}
for tag in soup.select('#pkgsrcslist a'):
print('url: ' +tag.get('href'))
print('text: ' + tag.text)
### update your a dict of package versions and links
pkgs.update({
tag.text:tag.get('href')
})
Example
from bs4 import BeautifulSoup
import requests
url="https://aur.archlinux.org/packages/iwgtk"
page = requests.get(url)
if page.status_code ==200:
soup = BeautifulSoup(page.text, 'html.parser')
pkgs = {}
for tag in soup.select('#pkgsrcslist a'):
pkgs.update({
tag.text:tag.get('href')
})
print(pkgs)
Output
{'iwgtk-0.8.tar.gz': 'https://github.com/J-Lentz/iwgtk/archive/v0.8.tar.gz'}
You can also query the AUR repository using the official Aurweb RPC interface using the type search
with a search-term as arg
:
https://aur.archlinux.org/rpc/?v=5&type=search&arg=iwgtk
It returns JSON by default:
{
"resultcount": 1,
"results": [
{
"Description": "Lightweight wireless network management GUI (front-end for iwd)",
"FirstSubmitted": 1597306328,
"ID": 1124939,
"LastModified": 1660234078,
"Maintainer": "J-Lentz",
"Name": "iwgtk",
"NumVotes": 19,
"OutOfDate": null,
"PackageBase": "iwgtk",
"PackageBaseID": 156689,
"Popularity": 1.748972,
"URL": "https://github.com/J-Lentz/iwgtk",
"URLPath": "/cgit/aur.git/snapshot/iwgtk.tar.gz",
"Version": "0.8-2"
}
],
"type": "search",
"version": 5
}
Your wanted information can be found at JSON-path .results[].URLPath
.
See also: