How to count the number of words in a .odt document?

Question:

I am trying to make a program that will go through all the folders and sub-folders, find all OpenOffice documents, open them and then count the words present in the file. The idea is to sum up the total later and output the total number of words found in a given folder.

I’m using the odfpy library for manipulating .odt files, but all the examples and documentations I can find are more concerned with adding things to the document, getting the style in a given element, replaing something etc etc. I can’t find any documention or examples about how to simply get the text in a doc.

Edit: Thank you karatekraft, your answear was just what I needed. Your code seems to get the total nr of characters rather than words, but that at least was within my abilities to do!

New def count_words_in_file(file_list) looks like this! (Currently, it only checks the document added in the file_path var, but its to late at night to fix now.)

def count_words_in_file(file_list):
    # This function will open all found .odt files, count the words, and then sum the total
    # ADJUST SO IT DOES THE SEARCH FOR ALL FILES
    file_path = "test.odt"
    from odf import text
    # Read document
    document_text = load(file_path)
    # Get all paragraphs in document
    all_paragraphs = document_text.getElementsByType(text.P)

    final_word_count = 0
    # For each paragraph, extract text and count number words.
    for paragraph in all_paragraphs:
        text = teletype.extractText(paragraph)
        words = text.split(" ")
        while '' in words:
            words.remove('')
        print(words)
        final_word_count = final_word_count + len(words)

    print(f"Final word count: {final_word_count}")
# This program will count the number of words and .odt docs
# in a folder and all its sub-folders. For ease of use it will check the folders above its
# current directory.

# Import the needed libraries
import os
from odf.opendocument import OpenDocumentText
from odf import text
# Make relly fucking sure the ODFPY module is installed, was pain in asshole. fuck programing

def main():

    # This variable is the current location of the script, attained with the os.path
    current_dir = os.path.dirname(os.path.abspath(__file__))
    # This variable changes the current_dir into the dir above the current one.
    above_dir = current_dir + ".."

    # Call the function to scan for .odt files
    file_list = scan_for_files(above_dir)

    # Call the function to open and count the .odt files
    count_words_in_file(file_list)

def scan_for_files(above_dir):
    # This list will store the path to all files found.
    file_list = []

    # This for-loop will go through all the folders that can be found
    for folder, subfolder, files in os.walk(above_dir):
        for file in files:
            complete_path = os.path.join(folder, file)

            file_list.append(complete_path)

    return(file_list)

def count_words_in_file(file_list):
    # This function will open all found .odt files, count the words, and then sum the total
    for file in file_list:
        if file.endswith(".odt"):
            textdoc = OpenDocumentText()
            for paragraph in textdoc.body.childNodes:
                print(paragraph)



main()
Asked By: Arvid Eriksson

||

Answers:

You can try this. It identifies all paragraphs, extracts the text from each paragraph, and gets total word count.

from odf import text, teletype
from odf.opendocument import load

file_path = "my_file.odt"

# Read document
document_text = load(file_path)
# Get all paragraphs in document
all_paragraphs = document_text.getElementsByType(text.P)

final_word_count = 0
# For each paragraph, extract text and count number words.
for paragraph in all_paragraphs:
    text = teletype.extractText(paragraph)
    final_word_count = final_word_count + len(text)
    
print(f"Final word count: {final_word_count}")

Answered By: karatekraft
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.