html-parsing | Page 3

Extracting an information from web page by machine learning

Extracting an information from web page by machine learning Question: I would like to extract a specific type of information from web pages in Python. Let’s say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to …

Total answers: 7

Difference between "findAll" and "find_all" in BeautifulSoup

Difference between "findAll" and "find_all" in BeautifulSoup Question: I would like to parse an HTML file with Python, and the module I am using is BeautifulSoup. It is said that the function find_all is the same as findAll. I’ve tried both of them, but I believe they are different: import urllib, urllib2, cookielib from BeautifulSoup …

Total answers: 2

Parsing HTML using Python

Parsing HTML using Python Question: I’m looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects. If I have a document of the form: <html> <head>Heading</head> <body attr1=’val1′> <div class=’container’> <div id=’class’>Something here</div> <div>Something else</div> </div> </body> </html> then it should give me a …

Total answers: 7

heavy regex – really time consuming

heavy regex – really time consuming Question: I have the following regex to detect start and end script tags in the html file: <script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script> meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script> it works but needs really long time to detect <script>, even minutes or hours for long strings The …

Total answers: 3

Iteratively parsing HTML (with lxml?)

Iteratively parsing HTML (with lxml?) Question: I’m currently trying to iteratively parse a very large HTML document (I know.. yuck) using lxml.etree.iterparse: Incremental parser. Parses XML into a tree and generates tuples (event, element) in a SAX-like fashion I am using an incremental/iterative/SAX approach to reduce the amount of memory used (I don’t want to …

Total answers: 5

Parse HTML/XML and find locations of elements in original document

Parse HTML/XML and find locations of elements in original document Question: Is there a way to get the original location of an element in a document, ie. the start and end character index, when parsing html/xml in Python? I’ve looked through the lxml documentation and couldn’t find anything. eg. <a>1</a><b>2</b> … print tree.find(‘b’).original_position # result: …

Total answers: 2

Web scraping – how to identify main content on a webpage

Web scraping – how to identify main content on a webpage Question: Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments. What’s a generic …

Total answers: 10

How to remove whitespace in BeautifulSoup

How to remove whitespace in BeautifulSoup Question: I have a bunch of HTML I’m parsing with BeautifulSoup and it’s been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output: <li><span class=”plaincharacterwrap break”> Zazzafooky but one two three! </span></li> <li><span …

Total answers: 4

How can I use the python HTMLParser library to extract data from a specific div tag?

How can I use the python HTMLParser library to extract data from a specific div tag? Question: I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this HTML element: … <div id="remository">20</div> … This is my HTMLParser …

Total answers: 4

jquery-like HTML parsing in Python?

jquery-like HTML parsing in Python? Question: Is there any way in Python that would allow me to parse an HTML document similar to what jQuery does? i.e. I’d like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc. Asked By: Roy Tang …

Total answers: 4