Create cartesian product of strings by replacing substrings from a list

Question

I have a dictionary with placeholders and their possible list of values, as shown below:

{
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

I want to create all possible combinations of strings by replacing the placeholders (i.e. ~GPE~ and ~PERSON~) from the template:

"My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year".

Expected output is:

"My name is John Davies. I travel to UK with Tom Banton every year."
"My name is John Davies. I travel to UK with Joe Morgan every year."
"My name is John Davies. I travel to USA with Tom Banton every year."
"My name is John Davies. I travel to USA with Joe Morgan every year."
"My name is Tom Banton. I travel to UK with John Davies every year."
"My name is Tom Banton. I travel to UK with Joe Morgan every year."
"My name is Tom Banton. I travel to USA with John Davies every year."
"My name is Tom Banton. I travel to USA with Joe Morgan every year."
"My name is Joe Morgan. I travel to UK with Tom Banton every year."
"My name is Joe Morgan. I travel to UK with John Davies every year."
"My name is Joe Morgan. I travel to USA with Tom Banton every year."
"My name is Joe Morgan. I travel to USA with John Davies every year."

Also notice how the values corresponding to a key in the dictionary do not repeat in the same sentence. e.g. I do not want: "My name is Joe Morgan. I travel to USA with Joe Morgan every year." (so not exactly cartesian product, but close enough)

I am new to python and experimenting with the re module, but could not find a solution to this problem.

EDIT

The main problem I am facing is replacing string causes the length to change, which makes subsequent modifications to the string difficult. This is especially due to possibility of multiple instances of same placeholder in the string. Below is a snippet to elaborate more:

label_dict = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}


template = "My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year."

for label in label_dict.keys():
    modified_string = template
    offset = 0
    for match in re.finditer(r'{}'.format(label), template):
        for label_text in label_dict.get(label, []):
            start, end = match.start() + offset, match.end() + offset
            offset += (len(label_text) - (end - start))
#             print ("Match was found at {start}-{end}: {match}".format(start = start, end = end, match = match.group()))
            modified_string = modified_string[: start] + label_text + modified_string[end: ]
            print(modified_string)

Gives the incorrect output as:

My name is ~PERSON~. I travel to UK with ~PERSON~ every year.
My name is ~PERSON~. I travel USA with ~PERSON~ every year.
My name is John Davies. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohTom Banton. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with John Davies every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohTom Banton every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohToJoe Morgan every year.

Asked By: Japun Japun

||

Source

Answer 1

Here are two ways, well three if you include the new code I added a moment ago, you could do it, they all produce the desired output.

Nested Loops

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = []
for gpe in data_in['~GPE~']:
    for person1 in data_in['~PERSON~']:
        for person2 in data_in['~PERSON~']:
            if person1 != person2: 
                data_out.append(f'My name is {person1}. I travel to {gpe} with {person2} every year.')

print('n'.join(data_out))

List Comprehension

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = [f'My name is {person1}. I travel to {gpe} with {person2} every year.' for gpe in data_in['~GPE~'] for person1 in data_in['~PERSON~'] for person2 in data_in['~PERSON~'] if person1!=person2]

print('n'.join(data_out))

Using merge from Pandas

Note, this code required Pandas version 1.2 or above.

import pandas as pd

data = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

country = pd.DataFrame({'country':data['~GPE~']})
person = pd.DataFrame({'person':data['~PERSON~']})

cart = country.merge(person, how='cross').merge(person, how='cross')

cart.columns = ['country', 'person1', 'person2']

cart = cart.query('person1 != person2').reset_index()

cart['sentence'] = cart.apply(lambda row: f"My name is {row['person1']}. I travel to {row['country']} with {row['person2']} every year." , axis=1)

sentences = cart['sentence'].to_list()

print('n'.join(sentences))

Answered By: norie

Create cartesian product of strings by replacing substrings from a list

Question:

Answers:

Nested Loops

List Comprehension

Using merge from Pandas