My scrapping code skips new line – Scrapy

Question:

I have this code to scrape review text from IMDB. I want to retrieve the entire text from the review, but it skips every time there is a new line, for example:

Saw an early screening tonight in Denver.

I don’t know where to begin. So I will start at the weakest link. The
acting. Still great, but any passable actor could have been given any
of the major roles and done a great job.

The code will only retrieve

Saw an early screening tonight in Denver.

Here is my code:

reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')
first_review = reviews[0]
sel2 = Selector(text = first_review.get_attribute('innerHTML'))

rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []

error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')

for d in tqdm(reviews):
    try:
        sel2 = Selector(text = d.get_attribute('innerHTML'))
        try:
            rating = sel2.css('.rating-other-user-rating span::text').extract_first()
        except:
            rating = np.NaN
        try:
            review = sel2.css('.text.show-more__control::text').get()
        except:
            review = np.NaN
        try:
            review_date = sel2.css('.review-date::text').extract_first()
        except:
            review_date = np.NaN    
        try:
            author = sel2.css('.display-name-link a::text').extract_first()
        except:
            author = np.NaN    
        try:
            review_title = sel2.css('a.title::text').extract_first()
        except:
            review_title = np.NaN

        rating_list.append(rating)
        review_date_list.append(review_date)
        review_title_list.append(review_title)
        author_list.append(author)
        review_list.append(review)

    except Exception as e:
        error_url_list.append(url)
        error_msg_list.append(e)
review_df = pd.DataFrame({
    'review_date':review_date_list,
    'author':author_list,
    'rating':rating_list,
    'review_title':review_title_list,
    'review':review_list
    })
Asked By: krsnbcd

||

Answers:

Use .extract() instead of .get() to extract all texts in the type of list. Then, you can use .join() to concatenate all texts into a single string.

review = sel2.css('.text.show-more__control::text').extract()
review = ' '.join(review)

output:

‘For a teenager today, Dunkirk must seem even more distant than the
Boer War did to my generation growing up just after WW2. For some,
Christopher Nolan’s film may be the most they will know about the
event. But it’s enough in some ways because even if it doesn’t show
everything that happened, maybe it goes as close as a film could to
letting you know how it felt. "Dunkirk" focuses on a number of
characters who are inside the event, living it ….’

Answered By: JayPeerachai