How to scrape the specific text from kworb and extract it as an excel file?
Question:
I’m trying to scrape the positions, the artists and the songs from a ranking list on kworb. For example: https://kworb.net/spotify/country/us_weekly.html
I used the following script:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://kworb.net/spotify/country/us_weekly.html")
content = response.content
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.get_text())
And here is the output:
ITUNES
WORLDWIDE
ARTISTS
CHARTS
DON'T PRAY
RADIO
SPOTIFY
YOUTUBE
TRENDING
HOME
CountriesArtistsListenersCities
Spotify Weekly Chart - United States - 2023/02/09 | Totals
PosP+Artist and TitleWksPk(x?)StreamsStreams+Total
1
+1
SZA - Kill Bill
9
1(x5)
15,560,813
+247,052
148,792,089
2
-1
Miley Cyrus - Flowers
4
1(x3)
13,934,413
-4,506,662
75,009,251
3
+20
Morgan Wallen - Last Night
2
3(x1)
11,560,741
+6,984,649
16,136,833
...
How do I only get the positions, the artists and the songs separately and store it as an excel first?
expected output:
Pos Artist Songs
1 SZA Kill Bill
2 Miley Cyrus Flowers
3 Morgan Wallen Last Night
...
Answers:
Best practice to scrape tables is using pandas.read_html()
it uses BeautifulSoup
under the hood for you.
import pandas as pd
#find table by id and select first index from list of dfs
df = pd.read_html('https://kworb.net/spotify/country/us_weekly.html', attrs={'id':'spotifyweekly'})[0]
#split the column by delimiter and creat your expected columns
df[['Artist','Song']]=df['Artist and Title'].str.split(' - ', n=1, expand=True)
#pick your columns and export to excel
df[['Pos','Artist','Song']].to_excel('yourfile.xlsx', index = False)
Alternative based on direct approach:
import requests
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/hk_weekly.html").content, 'html.parser')
data = []
for e in soup.select('#spotifyweekly tr:has(td)'):
data .append({
'Pos':e.td.text,
'Artist':e.a.text,
'Song':e.a.find_next_sibling('a').text
})
pd.DataFrame(data).to_excel('yourfile.xlsx', index = False)
Outputs
Pos
Artist
Song
1
SZA
Kill Bill
2
Miley Cyrus
Flowers
3
Morgan Wallen
Last Night
4
Metro Boomin
Creepin’
5
Lil Uzi Vert
Just Wanna Rock
6
Drake
Rich Flex
7
Metro Boomin
Superhero (Heroes & Villains) [with Future & Chris Brown]
8
Sam Smith
Unholy
…
I’m trying to scrape the positions, the artists and the songs from a ranking list on kworb. For example: https://kworb.net/spotify/country/us_weekly.html
I used the following script:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://kworb.net/spotify/country/us_weekly.html")
content = response.content
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.get_text())
And here is the output:
ITUNES
WORLDWIDE
ARTISTS
CHARTS
DON'T PRAY
RADIO
SPOTIFY
YOUTUBE
TRENDING
HOME
CountriesArtistsListenersCities
Spotify Weekly Chart - United States - 2023/02/09 | Totals
PosP+Artist and TitleWksPk(x?)StreamsStreams+Total
1
+1
SZA - Kill Bill
9
1(x5)
15,560,813
+247,052
148,792,089
2
-1
Miley Cyrus - Flowers
4
1(x3)
13,934,413
-4,506,662
75,009,251
3
+20
Morgan Wallen - Last Night
2
3(x1)
11,560,741
+6,984,649
16,136,833
...
How do I only get the positions, the artists and the songs separately and store it as an excel first?
expected output:
Pos Artist Songs
1 SZA Kill Bill
2 Miley Cyrus Flowers
3 Morgan Wallen Last Night
...
Best practice to scrape tables is using pandas.read_html()
it uses BeautifulSoup
under the hood for you.
import pandas as pd
#find table by id and select first index from list of dfs
df = pd.read_html('https://kworb.net/spotify/country/us_weekly.html', attrs={'id':'spotifyweekly'})[0]
#split the column by delimiter and creat your expected columns
df[['Artist','Song']]=df['Artist and Title'].str.split(' - ', n=1, expand=True)
#pick your columns and export to excel
df[['Pos','Artist','Song']].to_excel('yourfile.xlsx', index = False)
Alternative based on direct approach:
import requests
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/hk_weekly.html").content, 'html.parser')
data = []
for e in soup.select('#spotifyweekly tr:has(td)'):
data .append({
'Pos':e.td.text,
'Artist':e.a.text,
'Song':e.a.find_next_sibling('a').text
})
pd.DataFrame(data).to_excel('yourfile.xlsx', index = False)
Outputs
Pos | Artist | Song |
---|---|---|
1 | SZA | Kill Bill |
2 | Miley Cyrus | Flowers |
3 | Morgan Wallen | Last Night |
4 | Metro Boomin | Creepin’ |
5 | Lil Uzi Vert | Just Wanna Rock |
6 | Drake | Rich Flex |
7 | Metro Boomin | Superhero (Heroes & Villains) [with Future & Chris Brown] |
8 | Sam Smith | Unholy |
…