How to scrape the categories belonging to the datasets with BeautifulSoup?

Question:

I webscraped a site which has an url such as this: https://takipcimerkezi.net/services

I tried to get every information of the table except "aciklama"

This is my code :

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

url='https://takipcimerkezi.net/services'
page= requests.get(url)
table=BeautifulSoup(page.content, 'html.parser')

max_sipariş= table.find_all(attrs={"data-label":"Maksimum Sipariş"})
maxsiparis=[]
for i in max_sipariş:
    value=i.text
    
    maxsiparis.append(value)
min_sipariş= table.find_all(attrs={"data-label":"Minimum Sipariş"})
minsiparis=[]
for i in min_sipariş:
    value=i.text
    
minsiparis.append(value)
bin_adet_fiyati= table.find_all(attrs={"data-label":"1000 adet fiyatı "})
binadetfiyat=[]
for i in bin_adet_fiyati:
    value=i.text.strip()
    binadetfiyat.append(value)

id= table.find_all(attrs={"data-label":"ID"})
idlist=[]
for i in id:
    value=i.text
    idlist.append(value)

servis= table.find_all(attrs={"data-label":"Servis"})
servislist=[]
for i in servis:
    value=i.text
    servislist.append(value)
 

Then i took the values and put them into a excel sheet like this:
enter image description here

But, the last thing i need is, i need to add a new column for which category a row is in.

Eg: Row with the id:"158" is in the "Önerilen Servisler" category. Likewise id:"4","1526","1","1494"... and so on until id:"1537" this row need to be in " Instagram %100 Gerçek Premium Servisler" category.

I hope i explained the problem well how can i do such job ?

Asked By: luthierz

||

Answers:

To add parent category column to the dataframe you can use next example:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://takipcimerkezi.net/services"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for tr in soup.select("tr:not(:has(td[colspan], th))"):
    prev = tr.find_previous("td", attrs={"colspan": True})
    tds = [td.get_text(strip=True) for td in tr.select("td")]
    all_data.append([prev.get_text(strip=True), *tds[:5]])

df = pd.DataFrame(
    all_data,
    columns=["Parent", "ID", "Servis", "1000 adet fiyatı", "Minimum Sipariş", "Maksimum Sipariş"],
)
print(df.head())
df.to_csv("data.csv", index=False)

Prints:

               Parent    ID                                                                                                              Servis 1000 adet fiyatı Minimum Sipariş Maksimum Sipariş
0  Önerilen Servisler   158      3613-  Instagram Garantili Takipçi | Max 3M | Ömür Boyu Garantili | Düşüş Çok Az | Anlık Başlar | Günde 150K           13.17 TL             100          3000000
1  Önerilen Servisler     4  1495-  Instagram Garantili Takipçi | Max 1M | 365 Gün Telafi Garantili | Hızlı Başlar | 30 Gün Telafi Butonu Aktif         12.07 TL              50          5000000
2  Önerilen Servisler  1526            4513-  Instagram Takipçi | Max 500K | Yabancı Gerçek Kullanıcılar | Düşme Az | Anlık Başlar | Günde 250K         22.28 TL           10000           500000
3  Önerilen Servisler     1            3033-  Instagram Türk Takipçi | Max 25K | %90 Türk   | İptal Butonu Aktif | Anlık Başlar | Saatte 1K-2K         21.49 TL              10            25000
4  Önerilen Servisler  1494         991-  Instagram Çekilişle Takipçi | %100 Organik Türk   | Max 10K | Günlük İşleme Alınır | Günde 5K Atar !         37.50 TL            1000            10000

and saves data.csv (screenshot from LibreOffice):

enter image description here


EDIT: Little bit explanation of code above:

  • First I select all data row (rows that don’t contain table header or cells with colspan= attribute (the data in this row will become our "Parent" column). This is done with CSS selector "tr:not(:has(td[colspan], th))"

  • When iterating over these data rows, I need to know what is the "Parent". For this I use tr.find_previous("td", attrs={"colspan": True}) which will select <td> with the colspan= attribute.

  • I get all text from the <td> tags in this row and store it inside all_data list

  • From this list I create a pandas DataFrame

Answered By: Andrej Kesely

Simply adapt the approach from last post and scrape the categories first to map them while scraping the data:

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))

Example

from bs4 import BeautifulSoup
import pandas as pd
import requests

url='https://takipcimerkezi.net/services'

soup = BeautifulSoup(
        requests.get(
            url,
            cookies={'user_currency':'27d210f1c3ff7fe5d18b5b41f9b8bb351dd29922d175e2a144af68924e3064d1a%3A2%3A%7Bi%3A0%3Bs%3A13%3A%22user_currency%22%3Bi%3A1%3Bs%3A3%3A%22EUR%22%3B%7D;'}
        ).text
       )

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))

data =  []

for e in soup.select('#service-tbody tr:has([data-label="Minimum Sipariş"])'):
    d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
    d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
    data.append(d)
 
pd.DataFrame(data)[['ID',  'category', 'Servis', '1000 adet fiyatı', 'Minimum Sipariş','Maksimum Sipariş']]

Output

ID category Servis 1000 adet fiyatı Minimum Sipariş Maksimum Sipariş
0 158 Önerilen Servisler 3613- Instagram Garantili Takipçi | Max 3M | Ömür Boyu Garantili | Düşüş Çok Az | Anlık Başlar| Günde 150K ≈ 0.6573 € 100 3000000
1 4 Önerilen Servisler 1495- Instagram Garantili Takipçi | Max 1M | 365 Gün Telafi Garantili | Hızlı Başlar | 30 Gün Telafi Butonu Aktif ≈ 0.6024 € 50 5000000
1326 1039 Spotify Türk Dinlenme 1833-⬆️ Spotify Premium Türk Dinlenme | 5K Tek Paket | Normal ≈ 4.9778 € 5000 5000
1327 1040 Spotify Türk Dinlenme 1834-⬆️ Spotify Premium Türk Dinlenme | 10K Tek Paket | Normal ≈ 4.9778 € 10000 10000
Answered By: HedgeHog