scrapy: convert html string to HtmlResponse object
Question:
I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css
and xpath
, similar to scrapy’s response
. How can I do it?
Answers:
First of all, if it is for debugging or testing purposes, you can use the Scrapy shell
:
$ cat index.html
<div id="test">
Test text
</div>
$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'
There are different objects available in the shell during the session, like response
and request
.
Or, you can instantiate an HtmlResponse
class and provide the HTML string in body
:
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'
You can import native scrapy selector Selector
and declare the html string as the text arg to be parsed.
from scrapy.selector import Selector
def get_list_text_from_html_string(html_string):
html_item = Selector(text=html_string)
elements = [_li.get() for _li in html_item.css('ul > li::text')]
return elements
list_html_string = '<ul class="teams">n<li>Bayern M.</li>n<li>Palmeiras</li>n<li>Liverpool</li>n<li>Flamengo</li></ul>'
print(get_list_text_from_html_string(list_html_string))
>>> ['Bayern M.', 'Tigres', 'Liverpool', 'Flamengo']
I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css
and xpath
, similar to scrapy’s response
. How can I do it?
First of all, if it is for debugging or testing purposes, you can use the Scrapy shell
:
$ cat index.html
<div id="test">
Test text
</div>
$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'
There are different objects available in the shell during the session, like response
and request
.
Or, you can instantiate an HtmlResponse
class and provide the HTML string in body
:
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'
You can import native scrapy selector Selector
and declare the html string as the text arg to be parsed.
from scrapy.selector import Selector
def get_list_text_from_html_string(html_string):
html_item = Selector(text=html_string)
elements = [_li.get() for _li in html_item.css('ul > li::text')]
return elements
list_html_string = '<ul class="teams">n<li>Bayern M.</li>n<li>Palmeiras</li>n<li>Liverpool</li>n<li>Flamengo</li></ul>'
print(get_list_text_from_html_string(list_html_string))
>>> ['Bayern M.', 'Tigres', 'Liverpool', 'Flamengo']