String Compression in Python
Question:
I have the following input :
my_list = ["x d1","y d1","z d2","t d2"]
And would like to transform it into :
Expected_result = ["d1(x,y)","d2(z,t)"]
I had to use brute force, and also had to call pandas to my rescue, since I didn’t find any way to do it in plain/vanilla python. Do you have any other way to solve this?
import pandas as pd
my_list = ["x d1","y d1","z d2","t d2"]
df = pd.DataFrame(my_list,columns=["col1"])
df2 = df["col1"].str.split(" ",expand = True)
df2.columns = ["col1","col2"]
grp = df2.groupby(["col2"])
result = []
for grp_name, data in grp:
res = grp_name +"(" + ",".join(list(data["col1"])) + ")"
result.append(res)
print(result)
Answers:
- The code defines an empty dictionary.
- It then iterates over each item in your list and uses the
split()
method to split item into a key
and a value
.
- Then uses the
setdefault()
method to add the key
and the value
to the empty dictionary. If the value
already exists as a key
in the dictionary, it appends the key
to that value’s existing list of keys. And if the value
does not exist as a key in the dictionary, it creates a new key-value pair with the value as the key and the key as the first element in the new list.
- Finally, the list comprehension iterates over the items in the dictionary and creates a string for each key-value pair using
join()
method to concatenate the keys in the value list into a single string.
result = {}
for item in my_list:
key, value = item.split()
result.setdefault(value, []).append(key)
output = [f"{k}({', '.join(v)})" for k, v in result.items()]
print(output)
['d1(x, y)', 'd2(z, t)']
my_list = ["x d1","y d1","z d2","t d2"]
res = []
for item in my_list:
a, b, *_ = item.split()
if len(res) and b in res[-1]:
res[-1] = res[-1].replace(')', f',{a})')
else:
res.append(f'{b}({a})')
print(res)
['d1(x,y)', 'd2(z,t)']
Let N be the number that follows d, this code works for any number of elements within dN, as long as N is ordered, that is, d1 comes before d2, which comes before d3, … Works with any value of N , and you can use any letter in the d link as long as it has whatever value is in dN and then dN, keeping that order, "val_in_dN dN"
If you need something that works even if the dN are not in sequence, just say the word, but it will cost a little more
If your values are already sorted by key (d1, d2), you can use itertools.groupby
:
from itertools import groupby
out = [f"{k}({','.join(x[0] for x in g)})"
for k, g in groupby(map(str.split, my_list), lambda x: x[1])]
Output:
['d1(x,y)', 'd2(z,t)']
Otherwise you should use a dictionary as shown by @Jamiu.
A variant of your pandas solution:
out = (df['col1'].str.split(n=1, expand=True)
.groupby(1)[0]
.apply(lambda g: f"{g.name}({','.join(g)})")
.tolist()
)
Another possible solution, which is based on pandas
:
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({x.values[0]}, {x.values[1]})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x, y)', 'd2(z, t)']
EDIT
The OP, @ShckTchamna, would like to see the above solution modified, in order to be more general: The reason of this edit is to provide a solution that works with the example the OP gives in his comment below.
my_list = ["x d1","y d1","z d2","t d2","kk d2","m d3", "n d3", "s d4"]
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({",".join(x.values)})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x,y)', 'd2(z,t,kk)', 'd3(m,n)', 'd4(s)']
import pandas as pd
df = pd.DataFrame(data=[e.split(' ') for e in ["x d1","y d1","z d2","t d2"]])
r = (df.groupby(1)
.apply(lambda r:"{0}({1},{2})".format(r.iloc[0,1], r.iloc[0,0], r.iloc[1,0]))
.reset_index()
.rename({1:"points", 0:"coordinates"}, axis=1)
)
print(r.coordinates.tolist())
# ['d1(x,y)', 'd2(z,t)']
print(r)
# points coordinates
# 0 d1 d1(x,y)
# 1 d2 d2(z,t)
In replacement of my previous one (that works too) :
import itertools as it
my_list = [e.split(' ') for e in ["x d1","y d1","z d2","t d2"]]
r=[]
for key, group in it.groupby(my_list, lambda x: x[1]):
l=[e[0] for e in list(group)]
r.append("{0}({1},{2})".format(key, l[0], l[1]))
print(r)
Output :
['d1(x,y)', 'd2(z,t)']
I have the following input :
my_list = ["x d1","y d1","z d2","t d2"]
And would like to transform it into :
Expected_result = ["d1(x,y)","d2(z,t)"]
I had to use brute force, and also had to call pandas to my rescue, since I didn’t find any way to do it in plain/vanilla python. Do you have any other way to solve this?
import pandas as pd
my_list = ["x d1","y d1","z d2","t d2"]
df = pd.DataFrame(my_list,columns=["col1"])
df2 = df["col1"].str.split(" ",expand = True)
df2.columns = ["col1","col2"]
grp = df2.groupby(["col2"])
result = []
for grp_name, data in grp:
res = grp_name +"(" + ",".join(list(data["col1"])) + ")"
result.append(res)
print(result)
- The code defines an empty dictionary.
- It then iterates over each item in your list and uses the
split()
method to split item into akey
and avalue
. - Then uses the
setdefault()
method to add thekey
and thevalue
to the empty dictionary. If thevalue
already exists as akey
in the dictionary, it appends thekey
to that value’s existing list of keys. And if thevalue
does not exist as a key in the dictionary, it creates a new key-value pair with the value as the key and the key as the first element in the new list. - Finally, the list comprehension iterates over the items in the dictionary and creates a string for each key-value pair using
join()
method to concatenate the keys in the value list into a single string.
result = {}
for item in my_list:
key, value = item.split()
result.setdefault(value, []).append(key)
output = [f"{k}({', '.join(v)})" for k, v in result.items()]
print(output)
['d1(x, y)', 'd2(z, t)']
my_list = ["x d1","y d1","z d2","t d2"]
res = []
for item in my_list:
a, b, *_ = item.split()
if len(res) and b in res[-1]:
res[-1] = res[-1].replace(')', f',{a})')
else:
res.append(f'{b}({a})')
print(res)
['d1(x,y)', 'd2(z,t)']
Let N be the number that follows d, this code works for any number of elements within dN, as long as N is ordered, that is, d1 comes before d2, which comes before d3, … Works with any value of N , and you can use any letter in the d link as long as it has whatever value is in dN and then dN, keeping that order, "val_in_dN dN"
If you need something that works even if the dN are not in sequence, just say the word, but it will cost a little more
If your values are already sorted by key (d1, d2), you can use itertools.groupby
:
from itertools import groupby
out = [f"{k}({','.join(x[0] for x in g)})"
for k, g in groupby(map(str.split, my_list), lambda x: x[1])]
Output:
['d1(x,y)', 'd2(z,t)']
Otherwise you should use a dictionary as shown by @Jamiu.
A variant of your pandas solution:
out = (df['col1'].str.split(n=1, expand=True)
.groupby(1)[0]
.apply(lambda g: f"{g.name}({','.join(g)})")
.tolist()
)
Another possible solution, which is based on pandas
:
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({x.values[0]}, {x.values[1]})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x, y)', 'd2(z, t)']
EDIT
The OP, @ShckTchamna, would like to see the above solution modified, in order to be more general: The reason of this edit is to provide a solution that works with the example the OP gives in his comment below.
my_list = ["x d1","y d1","z d2","t d2","kk d2","m d3", "n d3", "s d4"]
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({",".join(x.values)})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x,y)', 'd2(z,t,kk)', 'd3(m,n)', 'd4(s)']
import pandas as pd
df = pd.DataFrame(data=[e.split(' ') for e in ["x d1","y d1","z d2","t d2"]])
r = (df.groupby(1)
.apply(lambda r:"{0}({1},{2})".format(r.iloc[0,1], r.iloc[0,0], r.iloc[1,0]))
.reset_index()
.rename({1:"points", 0:"coordinates"}, axis=1)
)
print(r.coordinates.tolist())
# ['d1(x,y)', 'd2(z,t)']
print(r)
# points coordinates
# 0 d1 d1(x,y)
# 1 d2 d2(z,t)
In replacement of my previous one (that works too) :
import itertools as it
my_list = [e.split(' ') for e in ["x d1","y d1","z d2","t d2"]]
r=[]
for key, group in it.groupby(my_list, lambda x: x[1]):
l=[e[0] for e in list(group)]
r.append("{0}({1},{2})".format(key, l[0], l[1]))
print(r)
Output :
['d1(x,y)', 'd2(z,t)']