working with BeautifulSoup – defining the entities for getting all the data of the target page – perhaps panda would solve this even better

Question:

i am in the mid of a task with BeautifulSoup – the awesome python-library for all things scraping. what is aimed: i want to get the data out of this page: https://schulfinder.kultus-bw.de note; its a public page for finding all schools in a certain region.

so a typical dataset will look like:

Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail 

well i think – with the usage of Python i will go like so:

firstly i will have to send a request to the URL and get the page HTML content:

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

afterwards – the next step i will have to create a BeautifulSoup object and find the HTML elements that contain the school names:

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
Extract the school names from the HTML elements and store them in a list:
school_names = [school.text.strip() for school in schools]

and subsequently i need to print the list of school names:

print(school_names)

well the complete code would look like this:

import requests
from bs4 import BeautifulSoup

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]

print(school_names)

but i need to have all the dataset –

Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail 

best thing would be to output it in CSV-formate; well if i would be a bit more familiar with Python then i would run this little code and would work with pandas – i guess that pandas would be much easier to work on that kind of thing.

..

update: see some images of the page:

enter image description here

enter image description here

update 2 i try to run this in google-colab: i get the following errors..
question: do i need to install some of the packages into collab!?

import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

do i need to take care for the preliminaries in google-colab?!

see the errorlog that i have gotten

100%|██████████| 676/676 [00:00<00:00, 381711.03it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

5 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'branches'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'branches'

end of errorlog – gotten from google-colab:

see below the errors – that i have gotten from Anaconda:

Anaconda: logs at home

100%|██████████| 676/676 [00:00<00:00, 9586.24it/s]
0it [00:00, ?it/s]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3628             try:
-> 3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:

~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'branches'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_27106/2163647892.py in <module>
     36     df = pd.DataFrame(all_data)
     37 
---> 38     df = df.explode('branches')
     39     df = df.explode('trades')
     40     df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)

~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in explode(self, column, ignore_index)
   8346         df = self.reset_index(drop=True)
   8347         if len(columns) == 1:
-> 8348             result = df[columns[0]].explode()
   8349         else:
   8350             mylen = lambda x: len(x) if is_list_like(x) else -1

~/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3503             if self.columns.nlevels > 1:
   3504                 return self._getitem_multilevel(key)
-> 3505             indexer = self.columns.get_loc(key)
   3506             if is_integer(indexer):
   3507                 indexer = [indexer]

~/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:
-> 3631                 raise KeyError(key) from err
   3632             except TypeError:
   3633                 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'branches'

conclusio: i am trying to find out more – i am eagerly trying to get more insights and to run the code …

many thanks for all the help – ahd for encouraging to dive in all things python… – this is awesme.
have a great day…

Asked By: malaga

||

Answers:

You can try this: When you enter aa and click "Suchen" the server returns all items that contains "aa". So you can try all combinations (aa, ab, ac, …) to get all school IDs and then get info about all schools:

import requests
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

api_url1 = 'https://schulfinder.kultus-bw.de/api/schools?distance=&outposts=1&owner=&school_kind=&term={term}&types=&work_schedule='
api_url2 = 'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}'

def get_school(term):
    try:
        return requests.get(api_url1.format(term=term)).json()
    except:
        return []

def get_school_detail(uuid):
    return requests.get(api_url2.format(uuid=uuid)).json()

if __name__ == '__main__':
    l = [''.join(t) for t in product(chars, chars)]
    # you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
    # l = [''.join(t) for t in product(chars, chars, chars)]

    all_data = []
    all_uuids = set()

    with Pool(processes=8) as pool:
        for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
            for item in result:
                all_uuids.add(item['uuid'])

    with Pool(processes=16) as pool:
        for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
            all_data.append(r)

    df = pd.DataFrame(all_data)

    df = df.explode('branches')
    df = df.explode('trades')
    df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
    df = pd.concat([df, df.pop('trades').apply(pd.Series).add_prefix('trade_')], axis=1)

    print(df.head())

    df.to_csv('data.csv', index=False)

This will get info about all 4461 schools and saves data to data.csv:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:38<00:00, 17.63it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4461/4461 [00:22<00:00, 194.86it/s]
  outpost_number                                                 name               street house_number postcode        city            phone              fax                              email                                   website tablet_tranche tablet_platform tablet_branches tablet_trades       lat      lng  official  branch_branch_id branch_acronym branch_description_long  trade_0 trade_trade_id trade_description
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  [email protected]         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             15110             RS              Realschule      NaN            NaN               NaN
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  [email protected]         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             14210            WRS          Werkrealschule      NaN            NaN               NaN
1              0              Schauenburg-Schule Grundschule Urloffen      Schauenburgstr.            4    77767  Appenweier     +49780597236    +497805914396  [email protected]  http://www.schauenburgschule-urloffen.de           None            None            None          None  48.56460  7.97361         0             12110             GS             Grundschule      NaN            NaN               NaN
2              0                      Klosterwiesenschule Grundschule            Boschstr.            1    88255      Baindt  +49750294114132  +49750294114139  [email protected]               http://www.baindt.de/schule           None            None            None          None  47.84319  9.65829         0             12110             GS             Grundschule      NaN            NaN               NaN
3              0                       Montessori-Grundschule Nußdorf          Zum Laugele            7    88662  Überlingen     +49755165620             None  [email protected]        http://www.grundschule-nussdorf.de           None            None            None          None  47.75325  9.19516         0             12110             GS             Grundschule      NaN            NaN               NaN

...

screenshot from LibreOffice:

enter image description here

Answered By: Andrej Kesely
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.