scrapy: convert html string to HtmlResponse object

Question

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy’s response. How can I do it?

Asked By: yayu

||

Source

Answer 1

First of all, if it is for debugging or testing purposes, you can use the Scrapy shell:

$ cat index.html
<div id="test">
    Test text
</div>

$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

There are different objects available in the shell during the session, like response and request.

Or, you can instantiate an HtmlResponse class and provide the HTML string in body:

>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

Answered By: alecxe

Answer 2

alecxe‘s answer is right, but this is the correct way to instantiate a Selector from text in scrapy:

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()

'good'

Answered By: Mohsen Mahmoodi

Answer 3

You can import native scrapy selector Selector and declare the html string as the text arg to be parsed.

from scrapy.selector import Selector


def get_list_text_from_html_string(html_string):
    html_item = Selector(text=html_string)
    elements = [_li.get() for _li in html_item.css('ul > li::text')]
    return elements

list_html_string = '<ul class="teams">n<li>Bayern M.</li>n<li>Palmeiras</li>n<li>Liverpool</li>n<li>Flamengo</li></ul>'
print(get_list_text_from_html_string(list_html_string))
>>> ['Bayern M.', 'Tigres', 'Liverpool', 'Flamengo']

Answered By: Kenny Aires

scrapy: convert html string to HtmlResponse object

Question:

Answers: