Python: exclude outer wrapping element when getting content via css/xpath selector

Question:

I tried this code to get the HTML content of element div.entry-content:

response.css('div.entry-content').get()

However, it returns the wrapping element too:

<div class="entry-content">
    <p>**my content**</p>
    <p>more content</p>
</div>

But I want just the contents, so in my case: <p>**my content**</p><p>more content</p>

I also tried an xpath selector response.xpath('//div[@class="entry-content"]').get(), but with the same result as above.

Based on F.Hoque’s answer below I tried:

response.xpath('//article/div[@class="entry-content"]//p/text()').getall() and response.xpath('//article/div[@class="entry-content"]//p').getall()

These however, returns arrays of respectively all p elements and the content of each found p element. I however want the HTML contents (in a single value) of the div.entry-content element without the wrapping element itself.

I’ve tried Googling, but can’t find anything.

Asked By: Adam

||

Answers:

You content is in the <p> tag, not the <div>

response.css('div.entry-content p').get()

or

response.xpath('//div[@class="entry-content"]/p').get()
Answered By: Guy

As you said, your main div contains multiple p tags and you want to extract the text node value from those p tags. //p will select all the p tags.

response.xpath('//div[@class="entry-content"]//p').getall()

The following expression will remove the array

p_tags = ''.join([x.get() for x in response.xpath('//article/div[@class="entry-content"]//p')])
Answered By: F.Hoque
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.