How to scrape the categories belonging to the datasets with BeautifulSoup?
Question:
I webscraped a site which has an url such as this: https://takipcimerkezi.net/services
I tried to get every information of the table except "aciklama"
This is my code :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
url='https://takipcimerkezi.net/services'
page= requests.get(url)
table=BeautifulSoup(page.content, 'html.parser')
max_sipariş= table.find_all(attrs={"data-label":"Maksimum Sipariş"})
maxsiparis=[]
for i in max_sipariş:
value=i.text
maxsiparis.append(value)
min_sipariş= table.find_all(attrs={"data-label":"Minimum Sipariş"})
minsiparis=[]
for i in min_sipariş:
value=i.text
minsiparis.append(value)
bin_adet_fiyati= table.find_all(attrs={"data-label":"1000 adet fiyatı "})
binadetfiyat=[]
for i in bin_adet_fiyati:
value=i.text.strip()
binadetfiyat.append(value)
id= table.find_all(attrs={"data-label":"ID"})
idlist=[]
for i in id:
value=i.text
idlist.append(value)
servis= table.find_all(attrs={"data-label":"Servis"})
servislist=[]
for i in servis:
value=i.text
servislist.append(value)
Then i took the values and put them into a excel sheet like this:
But, the last thing i need is, i need to add a new column for which category a row is in.
Eg: Row with the id:"158"
is in the "Önerilen Servisler"
category. Likewise id:"4","1526","1","1494"...
and so on until id:"1537"
this row need to be in " Instagram %100 Gerçek Premium Servisler"
category.
I hope i explained the problem well how can i do such job ?
Answers:
To add parent category column to the dataframe you can use next example:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://takipcimerkezi.net/services"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for tr in soup.select("tr:not(:has(td[colspan], th))"):
prev = tr.find_previous("td", attrs={"colspan": True})
tds = [td.get_text(strip=True) for td in tr.select("td")]
all_data.append([prev.get_text(strip=True), *tds[:5]])
df = pd.DataFrame(
all_data,
columns=["Parent", "ID", "Servis", "1000 adet fiyatı", "Minimum Sipariş", "Maksimum Sipariş"],
)
print(df.head())
df.to_csv("data.csv", index=False)
Prints:
Parent ID Servis 1000 adet fiyatı Minimum Sipariş Maksimum Sipariş
0 Önerilen Servisler 158 3613- Instagram Garantili Takipçi | Max 3M | Ömür Boyu Garantili | Düşüş Çok Az | Anlık Başlar | Günde 150K 13.17 TL 100 3000000
1 Önerilen Servisler 4 1495- Instagram Garantili Takipçi | Max 1M | 365 Gün Telafi Garantili | Hızlı Başlar | 30 Gün Telafi Butonu Aktif 12.07 TL 50 5000000
2 Önerilen Servisler 1526 4513- Instagram Takipçi | Max 500K | Yabancı Gerçek Kullanıcılar | Düşme Az | Anlık Başlar | Günde 250K 22.28 TL 10000 500000
3 Önerilen Servisler 1 3033- Instagram Türk Takipçi | Max 25K | %90 Türk | İptal Butonu Aktif | Anlık Başlar | Saatte 1K-2K 21.49 TL 10 25000
4 Önerilen Servisler 1494 991- Instagram Çekilişle Takipçi | %100 Organik Türk | Max 10K | Günlük İşleme Alınır | Günde 5K Atar ! 37.50 TL 1000 10000
and saves data.csv
(screenshot from LibreOffice):
EDIT: Little bit explanation of code above:
-
First I select all data row (rows that don’t contain table header or cells with colspan=
attribute (the data in this row will become our "Parent" column). This is done with CSS selector "tr:not(:has(td[colspan], th))"
-
When iterating over these data rows, I need to know what is the "Parent". For this I use tr.find_previous("td", attrs={"colspan": True})
which will select <td>
with the colspan=
attribute.
-
I get all text from the <td>
tags in this row and store it inside all_data
list
-
From this list I create a pandas DataFrame
Simply adapt the approach from last post and scrape the categories first to map them while scraping the data:
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))
Example
from bs4 import BeautifulSoup
import pandas as pd
import requests
url='https://takipcimerkezi.net/services'
soup = BeautifulSoup(
requests.get(
url,
cookies={'user_currency':'27d210f1c3ff7fe5d18b5b41f9b8bb351dd29922d175e2a144af68924e3064d1a%3A2%3A%7Bi%3A0%3Bs%3A13%3A%22user_currency%22%3Bi%3A1%3Bs%3A3%3A%22EUR%22%3B%7D;'}
).text
)
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))
data = []
for e in soup.select('#service-tbody tr:has([data-label="Minimum Sipariş"])'):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
data.append(d)
pd.DataFrame(data)[['ID', 'category', 'Servis', '1000 adet fiyatı', 'Minimum Sipariş','Maksimum Sipariş']]
Output
ID
category
Servis
1000 adet fiyatı
Minimum Sipariş
Maksimum Sipariş
0
158
Önerilen Servisler
3613- Instagram Garantili Takipçi | Max 3M | Ömür Boyu Garantili | Düşüş Çok Az | Anlık Başlar| Günde 150K
≈ 0.6573 €
100
3000000
1
4
Önerilen Servisler
1495- Instagram Garantili Takipçi | Max 1M | 365 Gün Telafi Garantili | Hızlı Başlar | 30 Gün Telafi Butonu Aktif
≈ 0.6024 €
50
5000000
…
1326
1039
Spotify Türk Dinlenme
1833-⬆️ Spotify Premium Türk Dinlenme | 5K Tek Paket | Normal
≈ 4.9778 €
5000
5000
1327
1040
Spotify Türk Dinlenme
1834-⬆️ Spotify Premium Türk Dinlenme | 10K Tek Paket | Normal
≈ 4.9778 €
10000
10000
I webscraped a site which has an url such as this: https://takipcimerkezi.net/services
I tried to get every information of the table except "aciklama"
This is my code :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
url='https://takipcimerkezi.net/services'
page= requests.get(url)
table=BeautifulSoup(page.content, 'html.parser')
max_sipariş= table.find_all(attrs={"data-label":"Maksimum Sipariş"})
maxsiparis=[]
for i in max_sipariş:
value=i.text
maxsiparis.append(value)
min_sipariş= table.find_all(attrs={"data-label":"Minimum Sipariş"})
minsiparis=[]
for i in min_sipariş:
value=i.text
minsiparis.append(value)
bin_adet_fiyati= table.find_all(attrs={"data-label":"1000 adet fiyatı "})
binadetfiyat=[]
for i in bin_adet_fiyati:
value=i.text.strip()
binadetfiyat.append(value)
id= table.find_all(attrs={"data-label":"ID"})
idlist=[]
for i in id:
value=i.text
idlist.append(value)
servis= table.find_all(attrs={"data-label":"Servis"})
servislist=[]
for i in servis:
value=i.text
servislist.append(value)
Then i took the values and put them into a excel sheet like this:
But, the last thing i need is, i need to add a new column for which category a row is in.
Eg: Row with the id:"158"
is in the "Önerilen Servisler"
category. Likewise id:"4","1526","1","1494"...
and so on until id:"1537"
this row need to be in " Instagram %100 Gerçek Premium Servisler"
category.
I hope i explained the problem well how can i do such job ?
To add parent category column to the dataframe you can use next example:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://takipcimerkezi.net/services"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for tr in soup.select("tr:not(:has(td[colspan], th))"):
prev = tr.find_previous("td", attrs={"colspan": True})
tds = [td.get_text(strip=True) for td in tr.select("td")]
all_data.append([prev.get_text(strip=True), *tds[:5]])
df = pd.DataFrame(
all_data,
columns=["Parent", "ID", "Servis", "1000 adet fiyatı", "Minimum Sipariş", "Maksimum Sipariş"],
)
print(df.head())
df.to_csv("data.csv", index=False)
Prints:
Parent ID Servis 1000 adet fiyatı Minimum Sipariş Maksimum Sipariş
0 Önerilen Servisler 158 3613- Instagram Garantili Takipçi | Max 3M | Ömür Boyu Garantili | Düşüş Çok Az | Anlık Başlar | Günde 150K 13.17 TL 100 3000000
1 Önerilen Servisler 4 1495- Instagram Garantili Takipçi | Max 1M | 365 Gün Telafi Garantili | Hızlı Başlar | 30 Gün Telafi Butonu Aktif 12.07 TL 50 5000000
2 Önerilen Servisler 1526 4513- Instagram Takipçi | Max 500K | Yabancı Gerçek Kullanıcılar | Düşme Az | Anlık Başlar | Günde 250K 22.28 TL 10000 500000
3 Önerilen Servisler 1 3033- Instagram Türk Takipçi | Max 25K | %90 Türk | İptal Butonu Aktif | Anlık Başlar | Saatte 1K-2K 21.49 TL 10 25000
4 Önerilen Servisler 1494 991- Instagram Çekilişle Takipçi | %100 Organik Türk | Max 10K | Günlük İşleme Alınır | Günde 5K Atar ! 37.50 TL 1000 10000
and saves data.csv
(screenshot from LibreOffice):
EDIT: Little bit explanation of code above:
-
First I select all data row (rows that don’t contain table header or cells with
colspan=
attribute (the data in this row will become our "Parent" column). This is done with CSS selector"tr:not(:has(td[colspan], th))"
-
When iterating over these data rows, I need to know what is the "Parent". For this I use
tr.find_previous("td", attrs={"colspan": True})
which will select<td>
with thecolspan=
attribute. -
I get all text from the
<td>
tags in this row and store it insideall_data
list -
From this list I create a pandas DataFrame
Simply adapt the approach from last post and scrape the categories first to map them while scraping the data:
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))
Example
from bs4 import BeautifulSoup
import pandas as pd
import requests
url='https://takipcimerkezi.net/services'
soup = BeautifulSoup(
requests.get(
url,
cookies={'user_currency':'27d210f1c3ff7fe5d18b5b41f9b8bb351dd29922d175e2a144af68924e3064d1a%3A2%3A%7Bi%3A0%3Bs%3A13%3A%22user_currency%22%3Bi%3A1%3Bs%3A3%3A%22EUR%22%3B%7D;'}
).text
)
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))
data = []
for e in soup.select('#service-tbody tr:has([data-label="Minimum Sipariş"])'):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
data.append(d)
pd.DataFrame(data)[['ID', 'category', 'Servis', '1000 adet fiyatı', 'Minimum Sipariş','Maksimum Sipariş']]
Output
ID | category | Servis | 1000 adet fiyatı | Minimum Sipariş | Maksimum Sipariş | |
---|---|---|---|---|---|---|
0 | 158 | Önerilen Servisler | 3613- Instagram Garantili Takipçi | Max 3M | Ömür Boyu Garantili | Düşüş Çok Az | Anlık Başlar| Günde 150K | ≈ 0.6573 € | 100 | 3000000 |
1 | 4 | Önerilen Servisler | 1495- Instagram Garantili Takipçi | Max 1M | 365 Gün Telafi Garantili | Hızlı Başlar | 30 Gün Telafi Butonu Aktif | ≈ 0.6024 € | 50 | 5000000 |
… | ||||||
1326 | 1039 | Spotify Türk Dinlenme | 1833-⬆️ Spotify Premium Türk Dinlenme | 5K Tek Paket | Normal | ≈ 4.9778 € | 5000 | 5000 |
1327 | 1040 | Spotify Türk Dinlenme | 1834-⬆️ Spotify Premium Türk Dinlenme | 10K Tek Paket | Normal | ≈ 4.9778 € | 10000 | 10000 |