Building XPath Query from Easylist.txt to count number of ads on webpage

Question:

So, I’m trying to write a script to gather the number of ads on a webpage. I’m basing this script on the following answer however, I keep getting the following error:

File "srclxmlxpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__  
File "srclxmlxpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: unknown error

This is the script:

import lxml.etree
import lxml.html
import requests
import cssselect

translator = cssselect.HTMLTranslator()

rules = []
rules_file = "easylist.txt"

with open(rules_file, 'r',encoding="UTF-8") as f:
    for line in f:
        # elemhide rules are prefixed by ## in the adblock filter syntax
        if line[:3] == '##.':
            try:
                rules.append(translator.css_to_xpath(line[2:],prefix=""))
            except cssselect.SelectorError:
                # just skip bad selectors
                pass

query = "|".join(rules)

url = 'http://google.com'  # replace it with a url you want to apply the rules to  

html = requests.get(url).text
document = lxml.html.document_fromstring(html)

print(len(document.xpath(query)))```

Any ideas on how to fix this error or potential alternative solutions to count the number of ads on a webpage would be appreciated. This is my first time working with lxml so I’m not sure what’s likely to be causing the issue in the query. For your reference, the EasyList I’m using is linked here

I’m pretty sure that this is an issue with the query that’s being built from EasyList as the code works when I hardcode a simple xpath query.

Asked By: Nathan Hoy

||

Answers:

Appreciate this is an old post so answering for the benefit of anyone else who needs a similar function. This worked for me to count the ads:

import lxml.etree
import lxml.html
import requests
import cssselect

def count_ads(url):
    print("Counting ads")
    translator = cssselect.HTMLTranslator()
    rules_file = f"\easylist.txt"
    html = requests.get(url).text
    count = 0 
    with open(rules_file, 'r',encoding="UTF-8") as f:
        for line in f:
            if line[:2] == '##': # elemhide rules are prefixed by ## in the adblock filter syntax
                try:
                    rule = translator.css_to_xpath(line[2:])
                    document = lxml.html.document_fromstring(html)
                    result = len(document.xpath(rule))
                    if result>0:
                        count = count+result
                except cssselect.SelectorError:
                    pass #skip bad selectors
    return count
Answered By: Nathan Hoy
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.