Extracting images from a .RTF file with Python
Question:
does anyone know how to extract or copy images from a .rtf file ?
I have tryed to look for a solution but from what I found, all of the libraries and articles people cite no longer exist or have non-existen documentation.
Answers:
Yes it is possible, but maybe you have to use an older version of python, something like 2.7
import pyth.plugins.rtf15.reader as reader
import base64
doc = reader.Rtf15Reader(open('Document.rtf', 'rb')).read()
for element in doc.content:
if element.__class__.__name__ == 'Image':
image_data = base64.b64decode(element.binary_data)
with open(f"{element.filename}", 'wb') as f:
f.write(image_data)
since I didn’t find a straightforward solution to extract images from a .rtf file I came up with a workaround.
I used the win32com lib to open the file and then saved it as a .docx:
word = win32com.client.Dispatch('Word.Application')
doc = word.Documents.Open(RtfFilePath)
doc.SaveAs(saveDocxPath, FileFormat=16)
doc.Close()
word.Quit()
This is way you can use docx2txt and other libraries that extract images from word files:
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')
Also I have found out that some images can be saved as .wmf, these files can’t be extarcted this way. I have found a workaround for this by using commands.
subprocess.run(f"tar -x -f {FileToExtaract} -C {TargetFolder}")
The extracted images will be located in your TargerFolderwordmedia.
You can convert them into any other image type using the Pillow library with this code:
from PIL import Image
Image.open("image.wmf").save("image.png")
does anyone know how to extract or copy images from a .rtf file ?
I have tryed to look for a solution but from what I found, all of the libraries and articles people cite no longer exist or have non-existen documentation.
Yes it is possible, but maybe you have to use an older version of python, something like 2.7
import pyth.plugins.rtf15.reader as reader
import base64
doc = reader.Rtf15Reader(open('Document.rtf', 'rb')).read()
for element in doc.content:
if element.__class__.__name__ == 'Image':
image_data = base64.b64decode(element.binary_data)
with open(f"{element.filename}", 'wb') as f:
f.write(image_data)
since I didn’t find a straightforward solution to extract images from a .rtf file I came up with a workaround.
I used the win32com lib to open the file and then saved it as a .docx:
word = win32com.client.Dispatch('Word.Application')
doc = word.Documents.Open(RtfFilePath)
doc.SaveAs(saveDocxPath, FileFormat=16)
doc.Close()
word.Quit()
This is way you can use docx2txt and other libraries that extract images from word files:
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')
Also I have found out that some images can be saved as .wmf, these files can’t be extarcted this way. I have found a workaround for this by using commands.
subprocess.run(f"tar -x -f {FileToExtaract} -C {TargetFolder}")
The extracted images will be located in your TargerFolderwordmedia.
You can convert them into any other image type using the Pillow library with this code:
from PIL import Image
Image.open("image.wmf").save("image.png")