scrapy: convert html string to HtmlResponse object

Question:

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy’s response. How can I do it?

Asked By: yayu

||

Answers:

First of all, if it is for debugging or testing purposes, you can use the Scrapy shell:

$ cat index.html
<div id="test">
    Test text
</div>

$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

There are different objects available in the shell during the session, like response and request.


Or, you can instantiate an HtmlResponse class and provide the HTML string in body:

>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'
Answered By: alecxe

alecxe‘s answer is right, but this is the correct way to instantiate a Selector from text in scrapy:

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()

'good'
Answered By: Mohsen Mahmoodi

You can import native scrapy selector Selector and declare the html string as the text arg to be parsed.

from scrapy.selector import Selector


def get_list_text_from_html_string(html_string):
    html_item = Selector(text=html_string)
    elements = [_li.get() for _li in html_item.css('ul > li::text')]
    return elements

list_html_string = '<ul class="teams">n<li>Bayern M.</li>n<li>Palmeiras</li>n<li>Liverpool</li>n<li>Flamengo</li></ul>'
print(get_list_text_from_html_string(list_html_string))
>>> ['Bayern M.', 'Tigres', 'Liverpool', 'Flamengo']
Answered By: Kenny Aires
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.