Building XPath Query from Easylist.txt to count number of ads on webpage
Question:
So, I’m trying to write a script to gather the number of ads on a webpage. I’m basing this script on the following answer however, I keep getting the following error:
File "srclxmlxpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "srclxmlxpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: unknown error
This is the script:
import lxml.etree
import lxml.html
import requests
import cssselect
translator = cssselect.HTMLTranslator()
rules = []
rules_file = "easylist.txt"
with open(rules_file, 'r',encoding="UTF-8") as f:
for line in f:
# elemhide rules are prefixed by ## in the adblock filter syntax
if line[:3] == '##.':
try:
rules.append(translator.css_to_xpath(line[2:],prefix=""))
except cssselect.SelectorError:
# just skip bad selectors
pass
query = "|".join(rules)
url = 'http://google.com' # replace it with a url you want to apply the rules to
html = requests.get(url).text
document = lxml.html.document_fromstring(html)
print(len(document.xpath(query)))```
Any ideas on how to fix this error or potential alternative solutions to count the number of ads on a webpage would be appreciated. This is my first time working with lxml so I’m not sure what’s likely to be causing the issue in the query. For your reference, the EasyList I’m using is linked here
I’m pretty sure that this is an issue with the query that’s being built from EasyList as the code works when I hardcode a simple xpath query.
Answers:
Appreciate this is an old post so answering for the benefit of anyone else who needs a similar function. This worked for me to count the ads:
import lxml.etree
import lxml.html
import requests
import cssselect
def count_ads(url):
print("Counting ads")
translator = cssselect.HTMLTranslator()
rules_file = f"\easylist.txt"
html = requests.get(url).text
count = 0
with open(rules_file, 'r',encoding="UTF-8") as f:
for line in f:
if line[:2] == '##': # elemhide rules are prefixed by ## in the adblock filter syntax
try:
rule = translator.css_to_xpath(line[2:])
document = lxml.html.document_fromstring(html)
result = len(document.xpath(rule))
if result>0:
count = count+result
except cssselect.SelectorError:
pass #skip bad selectors
return count
So, I’m trying to write a script to gather the number of ads on a webpage. I’m basing this script on the following answer however, I keep getting the following error:
File "srclxmlxpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "srclxmlxpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: unknown error
This is the script:
import lxml.etree
import lxml.html
import requests
import cssselect
translator = cssselect.HTMLTranslator()
rules = []
rules_file = "easylist.txt"
with open(rules_file, 'r',encoding="UTF-8") as f:
for line in f:
# elemhide rules are prefixed by ## in the adblock filter syntax
if line[:3] == '##.':
try:
rules.append(translator.css_to_xpath(line[2:],prefix=""))
except cssselect.SelectorError:
# just skip bad selectors
pass
query = "|".join(rules)
url = 'http://google.com' # replace it with a url you want to apply the rules to
html = requests.get(url).text
document = lxml.html.document_fromstring(html)
print(len(document.xpath(query)))```
Any ideas on how to fix this error or potential alternative solutions to count the number of ads on a webpage would be appreciated. This is my first time working with lxml so I’m not sure what’s likely to be causing the issue in the query. For your reference, the EasyList I’m using is linked here
I’m pretty sure that this is an issue with the query that’s being built from EasyList as the code works when I hardcode a simple xpath query.
Appreciate this is an old post so answering for the benefit of anyone else who needs a similar function. This worked for me to count the ads:
import lxml.etree
import lxml.html
import requests
import cssselect
def count_ads(url):
print("Counting ads")
translator = cssselect.HTMLTranslator()
rules_file = f"\easylist.txt"
html = requests.get(url).text
count = 0
with open(rules_file, 'r',encoding="UTF-8") as f:
for line in f:
if line[:2] == '##': # elemhide rules are prefixed by ## in the adblock filter syntax
try:
rule = translator.css_to_xpath(line[2:])
document = lxml.html.document_fromstring(html)
result = len(document.xpath(rule))
if result>0:
count = count+result
except cssselect.SelectorError:
pass #skip bad selectors
return count