scrapy returning an empty object

Question:

i am using css selector and continually get a response with empty values. Here is the code.

import scrapy 

class WebSpider(scrapy.Spider):
name = 'activities'
start_urls = [
    'http://capetown.travel/events/'
]


def parse(self, response):
    all_div_activities = response.css("div.tribe-events-content")#gdlr-core-pbf-column gdlr-core-column-60 gdlr-core-column-first
    title = all_div_activities.css("h2.tribe-events-list-event-title::text").extract()#gdlr-core-text-box-item-content
    price = all_div_activities.css(".span.ticket-cost::text").extract()
    details = all_div_activities.css(".p::text").extract()
    yield {
        'title':title,
        'price':price,
        'details':details
    }
Asked By: N.King

||

Answers:

In your code you’re looking to select all events but that output will be a list and you can’t select the title etc using extract() with a list as you are trying to do.

This is why you’re not getting the data you want. You will need to use a for loop to loop over each event on the page in your case looping over all_div_activities.

Code for Script

def parse(self,response):
    all_div_activities = response.css('div.tribe-events-event-content')
    for a in all_div_activities:
        title = a.css('a.tribe-event-url::text').get()

        if a.css('span.ticket-cost::text'):
            price = a.css('span.ticket-cost::text').get()
        else: 
            price = 'No price'

        details = a.css('div[class*="tribe-events-list-event-description"] > p::text').get()

        yield { 
               'title':title.strip(),
                'price':price,
                'details':details
              }

Notes

  1. Using an if statement for price because there were elements that had no price at all and so inputting some information is a good idea.
  2. Using strip() on title when yielding the dictionary as the title had space and n attached.

Advice

As a minor point, Scrapy suggests using get() and getall() methods rather than extract_first() and extract(). With extract() its not always possible to know the output is going to be a list or not, in this case the output I got was a list. This is why scrapy docs suggests using get() instead. It’s also abit more compact. With get() you will always get a string. This also meant that I could strip newlines and space with the title as you can see in the above code.

Another tip would be if the class attribute is quite long, use a *= selector as long as the partial attribute you select provides a unique result to the data you want. See here for abit more detail here.

Using items instead of yielding a dictionary may be better in the longrun, as you can set default values for data that in some events on the page you’re scraping and other events it’s not. You have to do this through a pipeline (again if you don’t understand this then don’t worry). See the docs for items and here for abit more on items.

Answered By: AaronS

Here is my one. Hope it will help you.

for item in response.css('div.tribe-events-event-content'): 
    print(item.css('a.tribe-event-url::text').get()) 
    print(item.css('span.ticket-cost::text').get()) 
    print(item.css('p::text').get()) 

Thanks.

Answered By: Samsul Islam

Here is some steps to get your code fixed

  • When use period before name that represents the element’s class name NOT HTML tag itself.. So change .span.ticket-cost::text –> span.ticket-cost::text
    Also .p::text –> p::text.
  • Obviously you trying to get a string so use get() method instead of extract() method which is return a list.
  • Make sure to use > when the desired text is inside the child element of the element you’ve select.
  • Finally here is a CSS Selector Reference https://www.w3schools.com/cssref/css_selectors.asp
Answered By: Khaled Badawy