Split list of string into groups of "similar string"

Question:

I have a list of string (about 300 strings).
Some of those strings start with the same characters.

For example:

  • "ACTUATOR_alim5V0ResetErrorRatio_1",
  • "ACTUATOR_alim5V0ResetErrorRatio_2",
  • "ACTUATOR_alim5V0ResetErrorSensors_1",
  • "SENSOR_inputPwm2DutyCycle",
  • "SENSOR_inputPwm2DigitalLevel",
  • "SENSOR_inputPwm2RisingEdgeCounter",
  • etc…

I want to group strings starting with the same characters.
For example:

  • "ACTUATOR_alim5V0ResetErrorRatio_1",
  • "ACTUATOR_alim5V0ResetErrorRatio_2",
  • "ACTUATOR_alim5V0ResetErrorSensors_1"

will belong to the "ACTUATOR_alim5V0ResetError" group.

In this example, all the strings which belong to the group start with "ACTUATOR_alim5V0ResetError", thus
the name of the group is "ACTUATOR_alim5V0ResetError".

Each group must not contain more than a predefined number of string (MAX_NUMBER_STRINGS_PER_GROUP).

I am looking for an algorithm that could solve this problem.
I have to translate this algorithm into Python.

Any help would be useful.

Answers:

I think this should do it. Just add your strings to the array and the dictionary should sort itself out. 🙂

Put the strings you want to assign into the AssignString Function and you can find them in StringDescriptorGroups when you need to.

StringDescriptors = [
"ACTUATOR_alim5V0ResetErrorRatio_1",
"ACTUATOR_alim5V0ResetErrorRatio_2",
"ACTUATOR_alim5V0ResetErrorSensors_1"
# . . .         < You can add more but here are the first few (they dont need to be in order)
]

StringDescriptorGroups = {}
# Here we set up our dictionary to assign these strings to their group via 
substring

for i in StringDescriptors:
    StringDescriptorGroups.update( {i : []} )
    #Create dictionary from string descriptors with empty arrays to assign these 
    inputs


def AssignString(s):
# Input an arbitrary string s
    for i in StringDescriptors:
     # Go trough each one in a linear fashion
     if i in s:
        StringDescriptorGroups[i].append(s)
        #Add to group if it contains descriptor
        return True
    # Work is done so return True
return False
#Couldn't find so return False
Answered By: Eden

I testing the trie algorithm, thanks to the georg’s advice:

trie.py

# Script Name   : trie.py
# Author        : The_Average_Engineer
# Created       : 26th April 2021
# Description   : Prefix tree
# useful link   : https://en.wikipedia.org/wiki/Trie
# useful link   : https://towardsdatascience.com/implementing-a-trie-data-structure-in-python-in-less-than-100-lines-of-code-a877ea23c1a1


class TrieNode(object):
    """
    Trie node implementation.
    """
    def __init__(self, char, parent):
        self.char = char
        self.parent = parent
        self.children = []
        # Is it the last character of the string.`
        self.is_string_finished = False
        # How many times this character appeared in the addition process
        self.counter = 1


def add(node, string):
    """
    Adding a string in the trie structure
    """
    for char in string:
        found_in_child = False
        # Search for the character in the children of the present `node`
        for child in node.children:
            if child.char == char:
                # We found it, increase the counter by 1 to keep track that another
                # word has it as well
                child.counter += 1
                # And point the node to the child that contains this char
                node = child
                found_in_child = True
                break
        # We did not find it so add a new child
        if not found_in_child:
            new_node = TrieNode(char, node)
            node.children.append(new_node)
            # And then point node to the new child
            node = new_node
    # Everything finished. Mark it as the end of a word.
    node.is_string_finished = True


def chunk_into_clusters(node, threshold, clusters_nodes):
    """
    Chunk trie into clusters.
    threshold is maximum number of string a cluster can contain.
    clusters_nodes is the output/resulting list of nodes.
    """
    for child in node.children:
        if (child.counter < threshold) and (len(child.children) > 1):
            clusters_nodes.append(child)
        elif child.is_string_finished:
            clusters_nodes.append(child)
        else:
            chunk_into_clusters(child, threshold, clusters_nodes)  # recursive call


def find_end_nodes(node, end_nodes):
    """
    Find nodes which end strings (nodes having the attribute "is_string_finished" set to True).
    threshold is the maximum number of string a cluster can contain.
    clusters_nodes is the output/resulting list of nodes.
    """
    for child in node.children:
        if child.is_string_finished:
            end_nodes.append(child)
        find_end_nodes(child, end_nodes)  # recursive call


def retrieve_string(node):
    """
    retrieve string from node.
    """
    string = ""
    while node is not None:  # while node is not root
        string = node.char + string
        node = node.parent
    return string

main.py

# Script Name   : main.py
# Author        : The_Average_Engineer
# Created       : 25th April 2021
# Description   : Testing string list clustering method

import string_list
import trie


if __name__ == '__main__':
    r"""
        script entry point
    """
    # max number of strings per cluster
    MAX_NB_STRINGS_PER_CLUSTER = 30

    # result dict
    clusters = {}

    # add strings to trie
    root = trie.TrieNode('', None)
    for string in string_list.strings:
        trie.add(root, string)

    # get clusters from trie
    clusters_nodes = []
    trie.chunk_into_clusters(root, MAX_NB_STRINGS_PER_CLUSTER, clusters_nodes)

    # get strings associated with each clusters nodes
    for cluster_node in clusters_nodes:
        cluster_node_string = trie.retrieve_string(cluster_node)

        clusters[cluster_node_string] = []

        # get strings contained in each cluster
        end_nodes = []
        trie.find_end_nodes(cluster_node, end_nodes)

        if cluster_node.is_string_finished:
            clusters[cluster_node_string].append(cluster_node_string)

        for end_node in end_nodes:
            end_node_string = trie.retrieve_string(end_node)
            clusters[cluster_node_string].append(end_node_string)

    # print results
    for cluster_name, cluster_strings in clusters.items():
        print("n{}:".format(cluster_name))
        for string in cluster_strings:
            print("t{}".format(string))

Output

SENSOR_inDigitalPD:
    SENSOR_inDigitalPD1Lvl
    SENSOR_inDigitalPD2Lvl

SENSOR_inputPwm1:
    SENSOR_inputPwm1FrequencyMeasure
    SENSOR_inputPwm1FrequencyMeasureValid
    SENSOR_inputPwm1FallingEdgeCounter
    SENSOR_inputPwm1DutyCycle
    SENSOR_inputPwm1DigitalLevel
    SENSOR_inputPwm1RisingEdgeCounter

SENSOR_inputPwm2:
    SENSOR_inputPwm2FrequencyMeasure
    SENSOR_inputPwm2FrequencyMeasureValid
    SENSOR_inputPwm2FallingEdgeCounter
    SENSOR_inputPwm2DutyCycle
    SENSOR_inputPwm2DigitalLevel
    SENSOR_inputPwm2RisingEdgeCounter

etc...

It seems to work pretty well.
If some of you have ideas to simplify this implementation, I’m more than interested !