(xpath )The matching information is incomplete.Special characters exist
Question:
import requests
from lxml import etree
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
'''
”<”、”&”
'''
r = etree.HTML(response.text)
l = r.xpath("//textarea[@id='song-list-pre-data']/text()")
print(l)
the l last:
lLevel":"exhigh","pl":320000},"djid":0,"fee":0,"album":{"id":158052587,"name":"Sakana~( ˵>ㅿㅿn’]
Incomplete Matching Information Due to Special Characters
How can I solve this problem?
==============================
import requests
from bs4 import BeautifulSoup
url = "https://music.163.com/discover/toplist?id=3779629"
headers = {
'user-agent': "PostmanRuntime/7.15.2"
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
textarea = soup.find('textarea', attrs={'id': 'song-list-pre-data'}).get_text()
print(textarea)
I’ve finally solved the problem
Use bs for parsing and do not configure lxml. Configure html.parser.
Answers:
import requests
from lxml import etree
from html import unescape
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
html = unescape(response.text)
r = etree.HTML(html)
l = r.xpath("//textarea[@id='song-list-pre-data']/text()")
print(l)
This should convert special characters like ‘<‘ and ‘&’ to their corresponding HTML entities and prevent them from causing issues when parsing the HTML with etree.
With HTML-Input, I often use BeautifulSoup – it seems to be more robust in terms of broken or strange data.
import requests
from bs4 import BeautifulSoup
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
textarea = soup.find('textarea', attrs={'id': 'song-list-predata'}).get_text()
print(textarea)
If you need a "real" XPath Expression, try:
import requests
from bs4 import BeautifulSoup
from lxml import etree
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
dom = etree.HTML(str(soup))
textarea = dom.xpath("//textarea[@id='song-list-pre-data']/text()")
print(textarea)
import requests
from lxml import etree
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
'''
”<”、”&”
'''
r = etree.HTML(response.text)
l = r.xpath("//textarea[@id='song-list-pre-data']/text()")
print(l)
the l last:
lLevel":"exhigh","pl":320000},"djid":0,"fee":0,"album":{"id":158052587,"name":"Sakana~( ˵>ㅿㅿn’]
Incomplete Matching Information Due to Special Characters
How can I solve this problem?
==============================
import requests
from bs4 import BeautifulSoup
url = "https://music.163.com/discover/toplist?id=3779629"
headers = {
'user-agent': "PostmanRuntime/7.15.2"
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
textarea = soup.find('textarea', attrs={'id': 'song-list-pre-data'}).get_text()
print(textarea)
I’ve finally solved the problem
Use bs for parsing and do not configure lxml. Configure html.parser.
import requests
from lxml import etree
from html import unescape
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
html = unescape(response.text)
r = etree.HTML(html)
l = r.xpath("//textarea[@id='song-list-pre-data']/text()")
print(l)
This should convert special characters like ‘<‘ and ‘&’ to their corresponding HTML entities and prevent them from causing issues when parsing the HTML with etree.
With HTML-Input, I often use BeautifulSoup – it seems to be more robust in terms of broken or strange data.
import requests
from bs4 import BeautifulSoup
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
textarea = soup.find('textarea', attrs={'id': 'song-list-predata'}).get_text()
print(textarea)
If you need a "real" XPath Expression, try:
import requests
from bs4 import BeautifulSoup
from lxml import etree
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
dom = etree.HTML(str(soup))
textarea = dom.xpath("//textarea[@id='song-list-pre-data']/text()")
print(textarea)