Extract single div from HTML and save in place. TypeError: expected a character buffer object

Question:

I am trying to write a simple Python script to continue with my learning of BeautifulSoup/Python.

The functionality I would like is simple, I’d like to extract a div from a HTML file and update the HTML file to contain only the contents of this div. For example, if my HTML, index.html, contained :

<head> 
  <title> Parsing HTML </title> 
</head>
<body>
  <h1> Title </h1>

  <div class="content"> 
    <p> This is the content </p> 
    <img src="img.jpg" />
  </div>
</body>

After my program runs, I’d like index.html to contain only

 <div class="content"> 
        <p> This is the content </p> 
        <img src="img.jpg" />
 </div>

So <div class="content"> would be used as a parameter to identify where to extract in the HTML.

I guess you need to use BeautifulSoup to write this, here is my attempt (for the code above). I’ve also tried to make it recursive:

import os
from bs4 import BeautifulSoup

def CleanUpFolder(dir):
    directory = os.listdir(dir)
    files = []

    for file in directory:
        if file.endswith('.html'):
            files.insert(0, file)
        if os.path.isdir(file):
            CleanUpFolder(file)
        for fileName in files:
            file = open(dir + "\" + fileName)
            content = file.read()
            file.close()
            soup = BeautifulSoup.BeautifulSoup(content)
            toWrite = soup.find("div", {"class": "main"})
            file = open(dir + "\" + fileName, 'w')
            file.write(toWrite)
            file.close()


dir = "C:UsersFolderDesktop\testFolder"
CleanUpFolder(dir)

My errors are :

Traceback (most recent call last):
  File "C:/Users/Admin/PycharmProjects/Extract-Main-2.py", line 25, in <module>
    CleanUpFolder(dir)

Line 25 is the final line (CleanUpfolder(dir)).

I don’t understand what is causing this.

I am also getting:

File "C:/Users/Admin/PycharmProjects/Extract-Main-2.py", line 20, in CleanUpFolder
file.write(toWrite)
TypeError: expected a character buffer object

This I got from some sample code on the BeautifulSoup docs so don’t understand why it doesn’t work.

I am finding BeautifulSoup far more difficult than I should to get my head around! What can I try to resolve this?

Asked By: Simon Kiely

||

Answers:

You should either

import bs4
    ...
soup = bs4.BeautifulSoup(content)

or

from bs4 import BeautifulSoup
    ...
soup = BeatifulSoup(content)

The issue is caused by attempting to run code written from BeautifulSoup 3 imported as BeautifulSoup on BeautifulSoup 4. Both modules contain a function called BeautifulSoup(), and it is that function your code should call.

Answered By: holdenweb