How to speed up a Wordle backtesting function to run 13,000 words in Python?

Question:

I am trying to find the most optimal Wordle starter word. I created a function to determine how good a starter was by seeing how many words it would eliminate from a 13,000 word list.

For example, if my starter word was PLANE, I want to eliminate every single word from the dataset that contains the letters ‘P’,’L’,’A’,’N’, or ‘E’. From that, I would see what % of words it eliminated and the word with the highest % would be the optimal starter word.

So suppose you ran backtest(starter="plane") it would return ('plane', '87.37%') .

I am trying to run all 13,000 possible wordle words through the backtester but it’s taking way too long on Python. How can I speed this up?

Creating List

words = []
with open('list.txt') as f:
    for line in f:
        words.append(line.strip())

Function

def backtest(starter):

    # list creation portion
    words = []
    with open('list.txt') as f:
        for line in f:
            words.append(line.strip())
    total = len(words)


    guess = starter
    result = "#####"
    
    tupleX = tuple(words)
    for word in tupleX:
        for i in range(5):
            
            if result[i] == "#" and guess[i] in word:
                words.remove(word)
                break
    
    pct = round(100 - (len(words)/total*100), 2)
    return guess, ("{}%".format(pct))

Backtest and append to dataframe

import pandas as pd

data = []

for i in words:
    hold = backtest(starter=i)
    data.append(hold)

bruteForce = pd.DataFrame(data, columns=['Word','Score'])
bruteForce = bruteForce.sort_values(by=['Score'], ascending=False)
bruteForce

I attempted removing the list creation portion from the backtesting function, but if I do that, it alters the original word list. When researching solutions, I came across multi-threading (which seems slightly too advanced for me, although I’m willing to give it a try) but I was wondering if anyone had any alternatives that I could attempt?

Asked By: StayShmacked

||

Answers:

To calculate elimination rates you can use built-in set data structure (please note, that Python 3.9+ is required for the used typing manner):

def read_words(path: str) -> list[str]
    """Read words to the list"""
    with open(path) as file:
        return file.read().splitlines()
        

def calculate_elimination_rates(words: list[str]) -> dict[str, float]:
    """Calculate elimination rate for each word"""
    # "-1" because we don't count the word itself
    words_number = len(words) - 1
    # Transform the list of words to the list of letters
    words_letters = [set(word) for word in words]

    # Elimination rates
    rates = {}

    for word in words:
        current_letters = set(word)
    
        rates[word] = (sum(
            1 for letters in words_letters 
            if len(letters - current_letters) == 0
        ) - 1) / words_number
    
    return rates


def main() -> None:
    words = read_words("list.txt")
    rates = calculate_elimination_rates(words)
    print(rates)

P.S. You example is slow not only because the used algorithm is slow itself, but also because each call of backtest reads data from harddrive.

Answered By: eightlay

One of your biggest performance issues here is removing elements from words in backtest. It leads to Python reallocating memory for the now shorter list. You do not need to remove those elements at all, you only use this list to get its length. Thus, you can very significantly improve the performance of your algorithm by keeping track of how many words didn’t fit by simply incrementing an integer rather than modifying a list. Then, when you return the result, you can use len(words) - removed, where removed is this integer. This can probably speed up the whole process more then 10x!

Not modifying the list also allows you to only load the list once from the file, which also speeds things up, althought not as significantly as the first change. Also, is there any use for the result variable? You could remove it (along with the always truthy check of result[i] == "#") and also get a tiny performance boost.

I made small changes to your code to address these issues with really big effects. It is definitely not the most optimal you can get, but it should finish within minutes and doesn’t require any changes in the main idea of the algorithm.

def backtest(starter, words):
    total = len(words)

    guess = starter
    result = "#####"
    
    # track how many words your algorith would remove
    removed = 0
    
    for word in words:
        for i in range(5):
            if result[i] == "#" and guess[i] in word:
                # instead of actually removing, increment this
                removed += 1
                break
    
    pct = round(100 - ((total - removed)/total*100), 2)
    return guess, ("{}%".format(pct))

main:

import pandas as pd
import time   # just to measure how long it runs

data = []
words = []
with open('list.txt') as f:
    for line in f:
        words.append(line.strip())

start = time.perf_counter()
j = 1

for i in words:
    hold = backtest(starter=i, words=words)
    data.append(hold)
    # just so you can see progress and timing
    if not j % 100:
        print(j, ": ", time.perf_counter() - start)
    j += 1

bruteForce = pd.DataFrame(data, columns=['Word','Score'])
bruteForce = bruteForce.sort_values(by=['Score'], ascending=False)
Answered By: Filip Müller
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.