web-crawler

Separating tag attributes as a dictionary

Separating tag attributes as a dictionary Question: My entry (The variable is of string type): <a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a> My expected output: { ‘href’: ‘https://wikipedia.org/’, ‘rel’: ‘nofollow ugc’, ‘text’: ‘wiki’, } How can I do this with Python? Without using beautifulsoup Library Please tell with the help of lxml library Asked By: Sardar || Source …

Total answers: 3

Scrapy extracting entire HTML element instead of following link

Scrapy extracting entire HTML element instead of following link Question: I’m trying to access or follow every link that appears for commercial contractors from this website: https://lslbc.louisiana.gov/contractor-search/search-type-contractor/ then extract the emails from the sites that each link leads to but when I run this script, scrapy follows the base url with the entire HTML element …

Total answers: 1

scrapy get tag a attribute values of rel

scrapy get tag a attribute values of rel Question: types of tags a: <a rel="sponsored" href="https://cheese.example.com/Appenzeller_cheese">Appenzeller</a> or <a rel="ugc" href="https://cheese.example.com/Appenzeller_cheese">Appenzeller</a> and one or more of the following values: rel="sponsored" or rel="ugc" or rel="ugc nofollow noreferrer" Apparently, Scrapy only supports the following value (Just "nofollow"): <a rel="nofollow" href="https://cheese.example.com/Appenzeller_cheese">Appenzeller</a> How can I get other values (like: ugc, …

Total answers: 1

News Article is not being scraped in h2 class

News Article is not being scraped in h2 class Question: I’m working on a project where I’ve been assigned to scrape all news articles from a website: ‘https://asia.nikkei.com/Spotlight/Podcast’. It has mainly two classes, h2(the giant card that says: Asia Stream: Shinzo Abe’s Assassination and Legacy) and h4 to scrape news articles. With my code, I’ve …

Total answers: 1

Python when a list of dictionaries includes different keys, how to use if else? (key error)

Python when a list of dictionaries includes different keys, how to use if else? (key error) Question: I have a list of dictionaries like this. Some data contains both first name and last name, and some data only includes first name: [‘info’: {‘id’: ‘abc’, ‘age’:23, ‘firstname’:’tom’, ‘lastname’:’don’, ‘phone’: 1324}] [‘info’: {‘id’: ‘cde’, ‘age’:24, ‘firstname’:’sara’, ‘lastname’:’man’, …

Total answers: 3

BeautifulSoup: how to get all article links from this link?

BeautifulSoup: how to get all article links from this link? Question: I want to get all article link from "https://www.cnnindonesia.com/search?query=covid" Here is my code: links = [] base_url = requests.get(f"https://www.cnnindonesia.com/search?query=covid") soup = bs(base_url.text, ‘html.parser’) cont = soup.find_all(‘div’, class_=’container’) for l in cont: l_cont = l.find_all(‘div’, class_=’l_content’) for bf in l_cont: bf_cont = bf.find_all(‘div’, class_=’box feed’) …

Total answers: 2

Saving images while crawling website in Selenium

Saving images while crawling website in Selenium Question: I would like to download images like those that can be found on this page. I need to download all of the images, each one once. Here’s the code I’m using: links = [] wait = WebDriverWait(driver, 5) all_images = wait.until( EC.presence_of_all_elements_located((By.XPATH, "//div[contains(@class,’swiper-button-next swiper-button-white’)]"))) for image in …

Total answers: 1

How to do a task after scraping all the pages of website using Scrapy-Python

How to do a task after scraping all the pages of website using Scrapy-Python Question: I want to perform some task after my scraper scrapes all the anchors of a home page of a website. But the print statement is executed before processing the parse_details of all pages. Any HELP would be appreciated. Thanks in …

Total answers: 1

Scrapy run crawl after another

Scrapy run crawl after another Question: I’m quite new to webscraping. I’m trying to crawl at novel reader website, to get the novel info and chapter content, so the way i do it is by creating 2 spider, one to fetch novel information and another one to fetch content of the chapter import scrapy class …

Total answers: 1

scrapy returning an empty object

scrapy returning an empty object Question: i am using css selector and continually get a response with empty values. Here is the code. import scrapy class WebSpider(scrapy.Spider): name = ‘activities’ start_urls = [ ‘http://capetown.travel/events/’ ] def parse(self, response): all_div_activities = response.css("div.tribe-events-content")#gdlr-core-pbf-column gdlr-core-column-60 gdlr-core-column-first title = all_div_activities.css("h2.tribe-events-list-event-title::text").extract()#gdlr-core-text-box-item-content price = all_div_activities.css(".span.ticket-cost::text").extract() details = all_div_activities.css(".p::text").extract() yield { ‘title’:title, …

Total answers: 3