python lxml.etree generating empty tree when given requests.get().text

Question:

I am trying to build a web scraper for TEDTalks and am running into an issue when generating transcripts. Python 3.10.4, lxml 4.9.2. First I am generating the html response like this.

text = requests.get('https://www.ted.com/talks/ted_countdown_how_do_we_get_the_world_off_fossil_fuels_quickly_and_fairly/transcript', headers={}).text

when checking the value of text, it shows me that it successfully collected an html output, since it is too large I cannot attach the whole response but it begins with

<!DOCTYPE html><html><head><link rel="preconnect" href="https://graphql.ted.com"/><link rel="dns-prefetch" href="https://graphql.ted.com"/><script src="https://cdn.cookielaw.org/consent/eb3a3101-85ef-45e5-a75f-dbd35e8d0b4d/OtAutoBlock.js"></script><script src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" charSet="UTF-8" data-domain-script="eb3a3101-85ef-45e5-a75f-dbd35e8d0b4d"></script><script>function OptanonWrapper() {n  const categoriesConsentedTo = window.OnetrustActiveGroups;n  /* eslint-reason The underscores help keep this name unique and make then   * variable name ugly, to dissuade users from referencing it unless absolutelyn   * necessary (similar to React's dangerouslySetInnerHTML.__html prop) */nn  /* eslint-disable-next-line no-underscore-dangle */nn  window.__userHasConsentedToTargetingCookies = window.OnetrustActiveGroups.includes('C0004');n  const activeGroupsLoaded = new Event('OnetrustActiveGroupsLoaded');n  window.dispatchEvent(activeGroupsLoaded);n  window.OneTrust.OnConsentChanged(() => {n    if (categoriesConsentedTo === window.OnetrustActiveGroups) {n      return;n    }nn    window.location.reload();n  });n}</script><script type="text/javascript" id="vwoCode"> window._vwo_code=window._vwo_code || (function() { var account_id=613676, version=1.3, settings_tolerance=2000, library_tolerance=2500, use_existing_jquery=false, is_spa=1, hide_element='body',

This is how I generate the tree

tree = lxml.etree.HTML(text)

once I do this, tree.text is empty, and tree.getchildren() returns this:

[<Element head at 0x169068400>, <Element body at 0x1690ed780>]

where for both elements, element.text is empty as well.

I was expecting a tree that I was able to do XPATH searches on, specifically for
/html/body/div[1]/div/main/div/div/div/aside/div[2]/div[2]/div/div/div[1]/div[3]/div[NUMBER]/span
where NUMBER is the associated block of transcript meant to be read. Instead, anytime I search for that XPATH, the lxml.etree returns an empty list. I have also tried generating the element object from lxml.html.fromtstring, and it also has empty text with no functional XPATH searching.

Here is a simple code block you can use to try and recreate the error.

import lxml
import requests
link = 'https://www.ted.com/talks/ted_countdown_how_do_we_get_the_world_off_fossil_fuels_quickly_and_fairly/transcript'
response = requests.get(link, headers={})
text = response.text
tree = lxml.etree.HTML(text)
print(tree.text)
Asked By: Sebewe

||

Answers:

You’re asking for tree.text, but tree is the outer HTML element:

>>> import lxml
>>> import requests
>>> link = 'https://www.ted.com/talks/ted_countdown_how_do_we_get_the_world_off_fossil_fuels_quickly_and_fairly/transcript'
>>> response = requests.get(link, headers={})
>>> text = response.text
>>> tree = lxml.etree.HTML(text)
>>> tree
<Element html at 0x7f60d96ac140>

Asking for the text attribute of an element only returns text that is directly contained by that element. For example:

>>> tree = lxml.etree.HTML('<html><body><p>This is some text</p></body></html>')
>>> tree.find('body').find('p').text
'This is some text'

Given your document, we can ask for the text element of things like links:

>>> [x.text for x in tree.xpath('//a')]
['Skip to main content', 'Skip to search', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'climate change', 'environment', 'global issues', 'science', 'sustainability', 'technology', 'business', 'pollution', 'Countdown', 'fossil fuels', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'Privacy Policy', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

Or we can ask for all elements that have non-null text attributes:

>>> [x for x in tree.xpath('//*[text() != ""]')]
[<Element script at 0x7f60d8eba2c0>, ..., <Element div at 0x7f60d7d4fa80>, <Element script at 0x7f60d7d4fac0>, <Element script at 0x7f60d7d4fb00>]
Answered By: larsks
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.