What can I do to improve the performance of a simple string search and replace script?

Question

I have a spreadsheet that contains 2 columns, the 1st is a column of strings that I need to search for, and the 2nd is a column of strings that the 1st column needs to be replaced with. There are close to 4000 rows in this spreadsheet. I have an example of the data shown below.

All of the strings in the "Tag Names" column are unique, however, there are some similarities – for example, e1diBC-B29hiTor, e1diBC-B29hiTorq, and e1diBC-B29hiTorqLim. That is, some strings can be strict subsets of others. I want to avoid inadvertently replacing a shorter version when a longer match is present, and I also want to be able to match these strings in a case-insensitive manner.

Tag Name                Address
e1diBC-B29DisSwt      ::[e1]mccE1:I.data[2].28
e1diBC-B29hiTor       ::[e1]Rack5:3:I.Data.3
e1diBC-B29hiTorq      ::[e1]Rack5:3:I.Data.4
e1diBC-B29hiTorqLim   ::[E1]BC_B29HiTorqueLimit
e1diBC-B29PlcRem      ::[e1]Rack5:3:I.Data.2
e1diBC-B29Run         ::[e1]Rack5:3:I.Data.0
e1diBC-B30DisSwt      ::[e1]mccE2:I.data[2].28
e1diBC-B30hiTor       ::[e1]Rack5:6:I.Data.3
e1diBC-B30hiTorq      ::[e1]Rack5:6:I.Data.4
e1diBC-B30PlcRem      ::[e1]Rack5:6:I.Data.2
e1diBC-B30Run         ::[e1]Rack5:6:I.Data.0
e1diBC-B32DisSwt      ::[E1]Rack5:1:I.Data.10
e1diBC-B32hiTor       ::[E1]Rack5:1:I.Data.13

I also have a little over 600 XML files that will need to be searched for the above strings and replaced with their appropriate replacement.

As a first step, I wrote a little script that would search all of the XML files for all of the strings that I am wanting to replace and am logging the locations of those found strings. My logging script works, but it is horrendously slow (5ish hours to process 100 XML files). Implementing a replace routine would only slow things down further, so I clearly need to rethink how I’m handling this. What can I do to speed things up?

Edit: Another requirement of mine is that the replace routine will need to preserve the capitalization of the remainder of the files that are being searched, so converting everything to lowercase ultimately would not work in my case.

# Import required libs
import pandas as pd
import os
import openpyxl
from Trie import Trie
import logging

logging.basicConfig(filename='searchResults.log', level=logging.INFO, format='%(asctime)s %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')

# Load the hmi tags into a Trie data structure and the addresses into an array.
# The Trie accepts a (key, value) pair, where key is the tag and value is the
# index of the associated array.
df_HMITags = pd.read_excel('Tags.xlsx')
logging.info('Loaded excel file')
HMITags = Trie()
addresses = []
for i in df_HMITags.index:
    HMITags.insert(str(df_HMITags[' Tag Name'][i]).lower(), i)
    addresses.append(str(df_HMITags[' Address'][i]))

# Assign directory
directory = 'Graphics'

# Iterate over the files in the directory
for filename in os.listdir(directory):
    file = os.path.join(directory, filename)
    
    # Checking if it is a file
    if os.path.isfile(file):
        logging.info('Searching File: ' + str(filename))
        print('Searching File:', filename)
        
        # Open the file
        with open(file,'r') as fp:
            
            # Search the file, one line at a time.
            lines = fp.readlines()
            lineNumber = 1
            for line in lines:
                if lineNumber %10 == 0:
                    print('Searching line number:', lineNumber)
                #logging.debug('Searching Line: ' + str(lineNumber))
                #print('Searching Line:', lineNumber)
                # Convert to lower case, as this will simplify searching.
                lineLowered = line.lower()
                
                # Iterate through the line searching for various tags.
                searchString = ''
                potentialMatchFound = False
                charIndex = 0
                while charIndex < len(lineLowered):
                    #logging.debug('charIndex: ' + str(charIndex))
                    #print('charIndex = ', charIndex, '---------------------------------------')
                    searchString = searchString + lineLowered[charIndex]
                    searchResults = HMITags.query(searchString)
                    
                    #if lineNumber == 2424:
                    ###print('searchString:', searchString)
                    ###print('searchResults length:', len(searchResults))
                    
                    # If the first char being searched does not return any results, move on to the next char.
                    if len(searchResults) > 0:
                        potentialMatchFound = True
                        ###print('Potential Match Found:', potentialMatchFound)
                    elif len(searchResults) == 0 and potentialMatchFound:
                        ###print('Determining if exact match exists')
                        # Remove the last char from the string.
                        searchString = searchString[:-1]
                        searchResults = HMITags.query(searchString)
                        
                        #Determine if an exact match exists in the search results
                        exactMatchFound = False
                        exactMatchIndex = 0
                        while exactMatchIndex < len(searchResults) and not exactMatchFound:
                            if searchString == searchResults[exactMatchIndex][0]:
                                exactMatchFound = True
                            exactMatchIndex = exactMatchIndex + 1
                        
                        if exactMatchFound:
                            logging.info('Match Found! File: ' + str(filename) + ' Line Number: ' + str(lineNumber) + ' Column: ' + str(charIndex - len(searchString) + 1) + ' HMI Tag: ' + searchString)
                            print('Found:', searchString)
                            charIndex = charIndex - 1
                        else:
                            ###print('Not Found:', searchString)
                            charIndex = charIndex - len(searchString)
                            
                        searchString = ''
                        potentialMatchFound = False
                    else:
                        searchString = ''
                    charIndex = charIndex + 1
                    
                lineNumber = lineNumber + 1

And my Trie implementation:

class TrieNode:
    """A node in the trie structure"""

    def __init__(self, char):
        # the character stored in this node
        self.char = char

        # whether this can be the end of a key
        self.is_end = False
        
        # The value from the (key, value) pair that is to be stored.
        # (if this node's is_end is True)
        self.value = 0

        # a dictionary of child nodes
        # keys are characters, values are nodes
        self.children = {}
        
class Trie(object):
    """The trie object"""

    def __init__(self):
        """
        The trie has at least the root node.
        The root node does not store any character
        """
        self.root = TrieNode("")
    
    def insert(self, key, value):
        """Insert a key into the trie"""
        node = self.root
        
        # Loop through each character in the key
        # Check if there is no child containing the character, create a new child for the current node
        for char in key:
            if char in node.children:
                node = node.children[char]
            else:
                # If a character is not found,
                # create a new node in the trie
                new_node = TrieNode(char)
                node.children[char] = new_node
                node = new_node
        
        # Mark the end of a key
        node.is_end = True
        
        # Set the value from the (key, value) pair.
        node.value = value
        
    def dfs(self, node, prefix):
        """Depth-first traversal of the trie
        
        Args:
            - node: the node to start with
            - prefix: the current prefix, for tracing a
                key while traversing the trie
        """
        if node.is_end:
            self.output.append((prefix + node.char, node.value))
        
        for child in node.children.values():
            self.dfs(child, prefix + node.char)
        
    def query(self, x):
        """Given an input (a prefix), retrieve all keys stored in
        the trie with that prefix, sort the keys by the number of 
        times they have been inserted
        """
        # Use a variable within the class to keep all possible outputs
        # As there can be more than one key with such prefix
        self.output = []
        node = self.root
        
        # Check if the prefix is in the trie
        for char in x:
            if char in node.children:
                node = node.children[char]
            else:
                # cannot found the prefix, return empty list
                return []
        
        # Traverse the trie to get all candidates
        self.dfs(node, x[:-1])

        # Sort the results in reverse order and return
        return sorted(self.output, key = lambda x: x[1], reverse = True)

Asked By: kubiej21

||

Source

Answer 1

I don’t have your actual data, but I created an (admittedly simple) test environment like so:

from random import choice, randint
from pathlib import Path
from string import ascii_letters


replace_table = [
    ('e1diBC-B29DisSwt', '::[e1]mccE1:I.data[2].28'),
    ('e1diBC-B29hiTor', '::[e1]Rack5:3:I.Data.3'),
    ('e1diBC-B29hiTorq', '::[e1]Rack5:3:I.Data.4'),
    ('e1diBC-B29hiTorqLim', '::[E1]BC_B29HiTorqueLimit'),
    ('e1diBC-B29PlcRem', '::[e1]Rack5:3:I.Data.2'),
    ('e1diBC-B29Run', '::[e1]Rack5:3:I.Data.0'),
    ('e1diBC-B30DisSwt', '::[e1]mccE2:I.data[2].28'),
    ('e1diBC-B30hiTor', '::[e1]Rack5:6:I.Data.3'),
    ('e1diBC-B30hiTorq', '::[e1]Rack5:6:I.Data.4'),
    ('e1diBC-B30PlcRem', '::[e1]Rack5:6:I.Data.2'),
    ('e1diBC-B30Run', '::[e1]Rack5:6:I.Data.0'),
    ('e1diBC-B32DisSwt', '::[E1]Rack5:1:I.Data.10'),
    ('e1diBC-B32hiTor', '::[E1]Rack5:1:I.Data.13'),
]

search_terms = [term for term, replacement in replace_table]
text = 'n'.join([
    choice(search_terms) if randint(0, 1) else ascii_letters
    for _ in range(300)
])
output_dir = Path('Graphics')

for i in range(1, 601):
    file = output_dir / f'{i}.txt'
    file.write_text(text)

This gives us 600 files, each of which has the same contents: 300 lines, where each line is either one of your search terms or a string of letters.

Your code (after altering it slightly to read the search-and-replace values from a list of tuples rather than Excel file) runs in 17.93 seconds on my computer, with the simple test data.

The simplest tool for replacing portions of strings is the built-in replace method of strings. However, since you want to preserve capitalization in the rest of the file contents but match your terms case-insensitively, this becomes impractical, and we must resort to regular expressions. (In either event, we’ll sort the search terms by order of decreasing length, to avoid accidentally replacing only part of a longer term.)

from pathlib import Path
import re

replace_table = [
    ('e1diBC-B29DisSwt', '::[e1]mccE1:I.data[2].28'),
    ('e1diBC-B29hiTor', '::[e1]Rack5:3:I.Data.3'),
    ('e1diBC-B29hiTorq', '::[e1]Rack5:3:I.Data.4'),
    ('e1diBC-B29hiTorqLim', '::[E1]BC_B29HiTorqueLimit'),
    ('e1diBC-B29PlcRem', '::[e1]Rack5:3:I.Data.2'),
    ('e1diBC-B29Run', '::[e1]Rack5:3:I.Data.0'),
    ('e1diBC-B30DisSwt', '::[e1]mccE2:I.data[2].28'),
    ('e1diBC-B30hiTor', '::[e1]Rack5:6:I.Data.3'),
    ('e1diBC-B30hiTorq', '::[e1]Rack5:6:I.Data.4'),
    ('e1diBC-B30PlcRem', '::[e1]Rack5:6:I.Data.2'),
    ('e1diBC-B30Run', '::[e1]Rack5:6:I.Data.0'),
    ('e1diBC-B32DisSwt', '::[E1]Rack5:1:I.Data.10'),
    ('e1diBC-B32hiTor', '::[E1]Rack5:1:I.Data.13'),
]
replace_table.sort(key=lambda x: len(x[0]), reverse=True)

# Create a dictionary where the keys are the lowercase search terms, and
# the values are the replacements.
replace_dict = {
    term.lower(): replacement
    for term, replacement in replace_table
}
# Compile a case-insensitive regex pattern that matches any of the
# search terms.
pattern = re.compile(
    '|'.join([re.escape(term) for term in replace_dict]),
    re.IGNORECASE
)
# Define a function that returns the proper replacement for a term,
# regardless of case.
def get_replacement(match):
    key = match.group().lower()
    return replace_dict[key]

source_dir = Path('Graphics')
output_dir = Path('output')
output_dir.mkdir(exist_ok=True)

for file in source_dir.iterdir():
    text = file.read_text()
    text = pattern.sub(get_replacement, text)
    output_file = output_dir / file.name
    output_file.write_text(text)

This reads, alters, and resaves all 600 files in .11 seconds. This is so much faster that it suggests to me you may not need a more complex implementation to try and shave time. That said, your results may differ if your files and/or list of search terms are sufficiently long.

Answered By: CrazyChucky

What can I do to improve the performance of a simple string search and replace script?

Question:

Answers: