Web scraping with Python without loading the whole page

Question:

I just started few web scraping projects with Python. I currently use lxml, Beautiful Soup and requests modules to scrape web pages. I need to know if there is any method to get only the data we need from the websites instead of getting the whole page loaded. The requests module does a GET request and receives the whole, bs4, lxml filters the data only. I tried out Selenium, but that also opens the browser which is not so suitable for a industrial project. I’m not aware about scrapy and splash.

I’m also not looking for the API key method, which is not applicable everywhere.

Asked By: AVDiv

Source

Answers:

Reverse engineering the api calls.

You should analyze the network tab for the incoming and outgoing requests and view response for each request. Alternatively you can also copy the request as curl and use postman to analyze the request. Postman has feature unique which generates python code for requests library and urllib library. Most of the sites return json response but sometimes however you may get html response.

Some sites do not allow scraping.
Make sure to check robot.txt for the website you will be scraping. You can find robot.txt by www.sitename.com/robots.txt.

For more info – https://www.youtube.com/watch?v=LPU08ZfP-II&list=PLL2hlSFBmWwwvFk4bBqaPRV4GP19CgZug

Answered By: Suyash Jawale