html-content-extraction

BeautifulSoup Grab Visible Webpage Text

How to scrape only visible webpage text with BeautifulSoup? Question: Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I …

Total answers: 11

Extract part of a regex match

Extract part of a regex match Question: I want a regular expression to extract the title from a HTML page. Currently I have this: title = re.search(‘<title>.*</title>’, html, re.IGNORECASE).group() if title: title = title.replace(‘<title>’, ”).replace(‘</title>’, ”) Is there a regular expression to extract just the contents of <title> so I don’t have to remove the …

Total answers: 11

Using BeautifulSoup to find a HTML tag that contains certain text

Using BeautifulSoup to find a HTML tag that contains certain text Question: I’m trying to get the elements in an HTML doc that contain the following pattern of text: #S{11} <h2> this is cool #12345678901 </h2> So, the previous would match by using: soup(‘h2′,text=re.compile(r’ #S{11}’)) And the results would be something like: [u’blahblah #223409823523′, u’thisisinteresting …

Total answers: 3

Extracting text from HTML file using Python

Extracting text from HTML file using Python Question: I’d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I’d like something more robust than using regular expressions that may fail on …

Total answers: 35