(xpath )The matching information is incomplete.Special characters exist

Question:

import requests
from lxml import etree

url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
    'User-Agent': "PostmanRuntime/7.15.2",
    }
response = requests.request("GET", url, headers=headers)

'''
”<”、”&”
'''
r = etree.HTML(response.text)

l = r.xpath("//textarea[@id='song-list-pre-data']/text()")

print(l)

the l last:
lLevel":"exhigh","pl":320000},"djid":0,"fee":0,"album":{"id":158052587,"name":"Sakana~( ˵>ㅿㅿn’]

Incomplete Matching Information Due to Special Characters
How can I solve this problem?

==============================

import requests
from bs4 import BeautifulSoup

url = "https://music.163.com/discover/toplist?id=3779629"
headers = {
    'user-agent': "PostmanRuntime/7.15.2"
    }
response = requests.request("GET", url, headers=headers)

soup = BeautifulSoup(response.text, "html.parser")
textarea = soup.find('textarea', attrs={'id': 'song-list-pre-data'}).get_text()


print(textarea)

I’ve finally solved the problem
Use bs for parsing and do not configure lxml. Configure html.parser.

Asked By: rui0908

||

Answers:

import requests
from lxml import etree
from html import unescape

url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
    'User-Agent': "PostmanRuntime/7.15.2",
    }
response = requests.request("GET", url, headers=headers)

html = unescape(response.text)
r = etree.HTML(html)

l = r.xpath("//textarea[@id='song-list-pre-data']/text()")

print(l)

This should convert special characters like ‘<‘ and ‘&’ to their corresponding HTML entities and prevent them from causing issues when parsing the HTML with etree.

Answered By: Charlotte Yu

With HTML-Input, I often use BeautifulSoup – it seems to be more robust in terms of broken or strange data.

import requests
from bs4 import BeautifulSoup

url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
   'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)

soup = BeautifulSoup(response.text, "lxml")
textarea = soup.find('textarea', attrs={'id': 'song-list-predata'}).get_text()

print(textarea)

If you need a "real" XPath Expression, try:

import requests
from bs4 import BeautifulSoup
from lxml import etree

url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
   'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)

soup = BeautifulSoup(response.text, "lxml")
dom = etree.HTML(str(soup))
textarea = dom.xpath("//textarea[@id='song-list-pre-data']/text()")

print(textarea)
Answered By: leu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.