How to discard innerText between certain elements using Selenium Python?

Question:

I am writing a webpage scraper to collate sentences in Japanese. My source utilises so-called furigana, which appears as characters above kanji to indicate the Kanji’s pronunciation. I do not want this furigana to appear in my scraped sentences.

The website’s source html looks something like (source: https://www3.nhk.or.jp/news/easy/k10014010651000/k10014010651000.html):

<article class = "article-main">
<p>
<span class="colorB">16</span><span class="color4"><ruby>日<rt>にち</rt></ruby></span>
<span class="colorB">、</span><span class="colorL"><ruby>韓国<rt>かんこく</rt></ruby
</span>
</p>
</article>

Which shows the characters between にち above the character 日 and かんこく above 韓国.

I current scrape the article-main element, and use get_attribute("innerText") to separate the article text, as follows:

element = browser.find_element(By.CLASS_NAME, "article-main")
article = element.get_attribute("innerText")
print(article)

However this outputs the furigana after the kanji within the sentences, so I end up with an output that looks like 16日にち、韓国かんこく instead of 16日、韓国. How can I remove the contents between ?

I have tried finding "rt" tag names, and replacing with "" as below:

element = browser.find_element(By.CLASS_NAME, "article-main")
html = element.get_attribute("innerHTML")
furigana = element.find_elements(By.TAG_NAME, "rt")
print(element.innerText.replace(furigana.innerText, ''))

But, the Webelement object has no innerText attribute. What approach can I take to isolate and remove the rt elements using Python?

Asked By: hpbristol

||

Answers:

You can use JavaScript to remove each of the <rt> elements.

furigana = element.find_elements(By.TAG_NAME, "rt")
browser.execute_script("for (const el of arguments[0]) el.remove();", furigana)

After that, you can read the innerText of the element.

article = element.get_attribute("innerText")
Answered By: Unmitigated