BeautifulSoup getting href of a list – need to simplify the script – in order to run in Colab

Question

I have the following soup:

next … From this I want to extract the href, "some_url"
this I want to extract the href, "some_url"
and the whole list of the pages that are listed on this page:

https://www.catholic-hierarchy.org/diocese/laa.html

note: there are a whole lot of links to sub-pages: which i need to parse. at the moment: getting all the data out it : -dioceses -Urls -description -contact-data -etc. etx.

the following is one way of getting that information, in an async fashion (should work on Colab notebooks). I got thet dioceses urls from a different part of the site (Structured view – World Regions). I would expect the dioceses count there to match the count from the letters list.

from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
from datetime import datetime
import asyncio
import nest_asyncio

nest_asyncio.apply()

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

big_df_list = []

def all_dioceses():
    dioceses = []
    root_links = [f'https://www.catholic-hierarchy.org/diocese/qview{x}.html' for x in range(1, 8)]
    with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
        for x in root_links:
            r = client.get(x)
            soup = bs(r.text)
            soup.select_one('ul#menu2').decompose()
            for link in soup.select('ul > li > a'):
                dioceses.append('https://www.catholic-hierarchy.org/diocese/' + link.get('href'))
    return dioceses
# print(all_dioceses())

async def get_diocese_info(url):
    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
        try:
            r = await client.get(url)
            soup = bs(r.text)
            d_name = soup.select_one('h1[align="center"]').get_text(strip=True)
            info_table = soup.select_one('div[id="d1"] > table')
            d_bishops = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[0].select('li')])
            d_extra_info = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[1].select('li')])
            big_df_list.append((d_name, d_bishops, d_extra_info, url))
            print('done', d_name)
        except Exception as e:
            print(url, e)

async def scrape_dioceses():
    start_time = datetime.now()
    tasks = asyncio.Queue()
    for x in all_dioceses():
        tasks.put_nowait(get_diocese_info(x))

    async def worker():
        while not tasks.empty():
            await tasks.get_nowait()
            
    await asyncio.gather(*[worker() for _ in range(100)])
    end_time = datetime.now()
    duration = end_time - start_time
    print('diocese scraping took', duration)

asyncio.run(scrape_dioceses())
df = pd.DataFrame(big_df_list, columns = ['Name', 'Bishops', 'Info', 'Url'])
print(df)

this should lead to the following resuts:

done Eparchy of Mississauga (Syro-Malabar)
done Eparchy of Mar Addai of Toronto (Chaldean)
done Eparchy of Saint-Sauveur de Montr�al (Melkite Greek)
done Diocese of Calgary
done Archdiocese of Winnipeg
[...]
diocese scraping took 0:03:02.366096

Name    Bishops Info    Url
0   Eparchy of Mississauga (Syro-Malabar)   JoseKalluvelil, Bishop  Type of Jurisdiction: Eparchy | Elevated:22 December2018 | Immediately Subject to the Holy See | Syro-Malabar Catholic Church of the Chaldean Tradition | Country:Canada | Mailing Address: Syro-Malabar Apostolic Exarchate, 6630 Turner Valley Rd., Mississauga, ON L5V 2P1, Canada | Telephone: (905)858-8200 | Fax: 858-8208    https://www.catholic-hierarchy.org/diocese/dmism.html
1   Eparchy of Mar Addai of Toronto (Chaldean)  Robert SaeedJarjis, Bishop | Bawai (Ashur)Soro, Bishop Emeritus Type of Jurisdiction: Eparchy | Erected:10 June2011 | Immediately Subject to the Holy See | Chaldean Catholic Church of the Chaldean Tradition | Country:Canada | Conference Region:Ontario | Mailing Address: 2 High Meadow Place, Toronto, ON M9L 2Z5, Canada | Telephone: (416)746-5816 | Fax: 746-5850  https://www.catholic-hierarchy.org/diocese/dtoch.html
2   Eparchy of Saint-Sauveur de Montr�al (Melkite Greek)    MiladJawish, B.S., Bishop   Type of Jurisdiction: Eparchy | Elevated:1 September1984 | Immediately Subject to the Holy See | Melkite Greek Catholic Church of the Byzantine Tradition | Country:Canada | Conference Region:Quebec | Web Site:http://www.melkite.com/ | Mailing Address: 10025 boul. de l'Arcadie, Montreal, QC H4N 2S1, Canada | Telephone: (514)272.6430 | Fax: 202.1274   https://www.catholic-hierarchy.org/diocese/dmome.html

note – it is for me impossible to run this on collab – how to simplify this in order to run this code in collab!?

well – i get errors – i get back this when running this in the collab:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-64bb145c85bf> in <module>
----> 1 from httpx import Client, AsyncClient, Limits
      2 from bs4 import BeautifulSoup as bs
      3 import pandas as pd
      4 import re
      5 from datetime import datetime
ModuleNotFoundError: No module named 'httpx'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

note – it is for me impossible to run this on collab – how to simplify this in order to run this code in collab!?

Mauro Martins mentioned to run this – but wait; i am not a pro user on collab-. so the question is: how to simplify this that i can run it on colab – on a ordinary collab account

!pip install httpx nest_asyncio

Try running this code before your script.

Many thanks for the quick reply. Awesome. i understand your approach: but i need a pro account on colab – note: i do not have this . So the question is: Can i simplify the script so that it would run on a general collab account – without any issues

many thanks – dear Mauro Martins Junior – it helps – this code helped.:

!pip install httpx nest_asyncio

note:

update: thanks to Mauro Martin i have learned to update plugins to colab:

How do I install Python packages in Google’s Colab?

How do I install Python packages in Google's Colab?

In a project, I have e.g. two different packages, How can I use the
setup.py to install these two packages in the Google’s Colab, so that
I can import the packages?

see the answer:

you can use !setup.py install to do that. Colab is just like a Jupyter
notebook. Therefore, we can use the ! operator here to install any
package in Colab. What ! actually does is, it tells the notebook cell
that this line is not a Python code, its a command line script. So, to
run any command line script in Colab, just add a ! preceding the line.
For example: !pip install tensorflow. This will treat that line (here
pip install tensorflow) as a command prompt line and not some Python
code. However, if you do this without adding the ! preceding the line,
it’ll throw up an error saying "invalid syntax". But keep in mind that
you’ll have to upload the setup.py file to your drive before doing
this (preferably into the same folder where your notebook is).

and also the conda environment:

conda environment in google colab [google-colaboratory]
conda environment in google colab [google-colaboratory]

I am trying to create a conda environmet in google colab notebook. I
succesfully installed conda with the following comannd

conda environment in google colab [google-colaboratory]
conda environment in google colab [google-colaboratory]
I am trying to create a conda environmet in google colab notebook. I succesfully installed conda with the following comannd

I am trying to create a conda environmet in google colab notebook. I succesfully installed conda with the following comannd

!wget -c https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh

!chmod +x Anaconda3-5.1.0-Linux-x86_64.sh

!bash ./Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p /usr/local

Default python which is using by system is now Python 3.6.4 :: Anaconda, Inc.

I am trying to create an environment in conda by conda env create -f environment.yml

Every package got successfully installed but the problem now is that I am not able to activate this environment. I tried source activate myenv. but it also didn’t worked.

After conda env list command I got two environments

base * /usr/local

myenv /usr/local/envs/myenv

Asked By: thannen

||

Source

Answer 1

It seems that the problem is that some libs are missing. I tried to run here on colabs and after running the command below, it worked fine.

!pip install httpx nest_asyncio

Try running this code before your script.

Note that the exclamation point is part of the command.

Answered By: Mauro Martins Júnior

BeautifulSoup getting href of a list – need to simplify the script – in order to run in Colab

Question:

Answers: