Extract single div from HTML and save in place. TypeError: expected a character buffer object
Question:
I am trying to write a simple Python script to continue with my learning of BeautifulSoup/Python.
The functionality I would like is simple, I’d like to extract a div from a HTML file and update the HTML file to contain only the contents of this div. For example, if my HTML, index.html, contained :
<head>
<title> Parsing HTML </title>
</head>
<body>
<h1> Title </h1>
<div class="content">
<p> This is the content </p>
<img src="img.jpg" />
</div>
</body>
After my program runs, I’d like index.html to contain only
<div class="content">
<p> This is the content </p>
<img src="img.jpg" />
</div>
So <div class="content">
would be used as a parameter to identify where to extract in the HTML.
I guess you need to use BeautifulSoup to write this, here is my attempt (for the code above). I’ve also tried to make it recursive:
import os
from bs4 import BeautifulSoup
def CleanUpFolder(dir):
directory = os.listdir(dir)
files = []
for file in directory:
if file.endswith('.html'):
files.insert(0, file)
if os.path.isdir(file):
CleanUpFolder(file)
for fileName in files:
file = open(dir + "\" + fileName)
content = file.read()
file.close()
soup = BeautifulSoup.BeautifulSoup(content)
toWrite = soup.find("div", {"class": "main"})
file = open(dir + "\" + fileName, 'w')
file.write(toWrite)
file.close()
dir = "C:UsersFolderDesktop\testFolder"
CleanUpFolder(dir)
My errors are :
Traceback (most recent call last):
File "C:/Users/Admin/PycharmProjects/Extract-Main-2.py", line 25, in <module>
CleanUpFolder(dir)
Line 25 is the final line (CleanUpfolder(dir)
).
I don’t understand what is causing this.
I am also getting:
File "C:/Users/Admin/PycharmProjects/Extract-Main-2.py", line 20, in CleanUpFolder
file.write(toWrite)
TypeError: expected a character buffer object
This I got from some sample code on the BeautifulSoup docs so don’t understand why it doesn’t work.
I am finding BeautifulSoup far more difficult than I should to get my head around! What can I try to resolve this?
Answers:
You should either
import bs4
...
soup = bs4.BeautifulSoup(content)
or
from bs4 import BeautifulSoup
...
soup = BeatifulSoup(content)
The issue is caused by attempting to run code written from BeautifulSoup 3 imported as BeautifulSoup
on BeautifulSoup 4. Both modules contain a function called BeautifulSoup()
, and it is that function your code should call.
I am trying to write a simple Python script to continue with my learning of BeautifulSoup/Python.
The functionality I would like is simple, I’d like to extract a div from a HTML file and update the HTML file to contain only the contents of this div. For example, if my HTML, index.html, contained :
<head>
<title> Parsing HTML </title>
</head>
<body>
<h1> Title </h1>
<div class="content">
<p> This is the content </p>
<img src="img.jpg" />
</div>
</body>
After my program runs, I’d like index.html to contain only
<div class="content">
<p> This is the content </p>
<img src="img.jpg" />
</div>
So <div class="content">
would be used as a parameter to identify where to extract in the HTML.
I guess you need to use BeautifulSoup to write this, here is my attempt (for the code above). I’ve also tried to make it recursive:
import os
from bs4 import BeautifulSoup
def CleanUpFolder(dir):
directory = os.listdir(dir)
files = []
for file in directory:
if file.endswith('.html'):
files.insert(0, file)
if os.path.isdir(file):
CleanUpFolder(file)
for fileName in files:
file = open(dir + "\" + fileName)
content = file.read()
file.close()
soup = BeautifulSoup.BeautifulSoup(content)
toWrite = soup.find("div", {"class": "main"})
file = open(dir + "\" + fileName, 'w')
file.write(toWrite)
file.close()
dir = "C:UsersFolderDesktop\testFolder"
CleanUpFolder(dir)
My errors are :
Traceback (most recent call last):
File "C:/Users/Admin/PycharmProjects/Extract-Main-2.py", line 25, in <module>
CleanUpFolder(dir)
Line 25 is the final line (CleanUpfolder(dir)
).
I don’t understand what is causing this.
I am also getting:
File "C:/Users/Admin/PycharmProjects/Extract-Main-2.py", line 20, in CleanUpFolder
file.write(toWrite)
TypeError: expected a character buffer object
This I got from some sample code on the BeautifulSoup docs so don’t understand why it doesn’t work.
I am finding BeautifulSoup far more difficult than I should to get my head around! What can I try to resolve this?
You should either
import bs4
...
soup = bs4.BeautifulSoup(content)
or
from bs4 import BeautifulSoup
...
soup = BeatifulSoup(content)
The issue is caused by attempting to run code written from BeautifulSoup 3 imported as BeautifulSoup
on BeautifulSoup 4. Both modules contain a function called BeautifulSoup()
, and it is that function your code should call.