How to get more records from google drive using beautifulsoup?

Question:

from bs4 import BeautifulSoup,SoupStrainer
import pandas as pd
from urllib import request

websiteResponse = request.urlopen("https://drive.google.com/drive/folders/1N57pzcIWBbrJkze-6AILyegJ682PARYa")
folders = BeautifulSoup(websiteResponse, "html.parser", parse_only=SoupStrainer('div', attrs={'class':'WYuW0e RDfNAe Ss7qXc'}))
links = []
for a in folders:
    links.append("https://drive.google.com/drive/folders/"+a['data-id'])
    print("https://drive.google.com/drive/folders/"+a['data-id'])
        
df = pd.DataFrame({' Link':links}) 
df.to_csv('links.csv', index=False)

Hey, everyone, I want to scrap data from google drive it’s around 500 folders, and inside each folder, there are images I just want the folder URL but when I run the following code it fetches only 50 records.
There is no pagination on the google drive webpage when I scroll to the end of the page it loads more records

Asked By: Jamshid Ali

||

Answers:

Run this code:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

import chromedriver_autoinstaller as chromedriver
chromedriver.install()

# Launch a web browser
driver = webdriver.Chrome()
links = []

# Navigate to the website
driver.get("https://drive.google.com/drive/folders/1N57pzcIWBbrJkze-6AILyegJ682PARYa")

# Scroll to the end of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for the page to load
#and during this time scroll manually to the end of the window just pop up by this program
time.sleep(20)

# Retrieve the updated HTML source code
html_source = driver.page_source

# Parse the HTML source code using Beautiful Soup
soup = BeautifulSoup(html_source, "html.parser")

# Extract all elements with a data-id attribute
elements = soup.find_all("div", attrs={"data-id": True})

# Print the data-id attribute text
counter = 0
for element in elements:
    print(element.get("data-id"))
    counter+=1
    print(counter)
    links.append("https://drive.google.com/drive/folders/"+element.get("data-id"))

df = pd.DataFrame({'Links':links}) 
df.to_csv('test.csv', index=False)
# Close the web browser
driver.close()

Answered By: Jamshid Ali