Python: Rename duplicates in list with progressive numbers without sorting list
Question:
Given a list like this:
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
I would like to rename the duplicates by appending a number to get the following result:
mylist = ["name1", "state", "name2", "city", "name3", "zip1", "zip2"]
I do not want to change the order of the original list. The solutions suggested for this related Stack Overflow question sorts the list, which I do not want to do.
Answers:
You can use hashtable to solve this problem. Define a dictionary d. key is the string and value is (first_time_index_in_the_list, times_of_appearance). Everytime when you see a word, just check the dictionary, and if the value is 2, use first_time_index_in_the_list to append ‘1’ to the first element, and append times_of_appearance to current element. If greater than 2, just append times_of_appearance to current element.
My solution with map
and lambda
:
print map(lambda x: x[1] + str(mylist[:x[0]].count(x[1]) + 1) if mylist.count(x[1]) > 1 else x[1], enumerate(mylist))
More traditional form
newlist = []
for i, v in enumerate(mylist):
totalcount = mylist.count(v)
count = mylist[:i].count(v)
newlist.append(v + str(count + 1) if totalcount > 1 else v)
And last one
[v + str(mylist[:i].count(v) + 1) if mylist.count(v) > 1 else v for i, v in enumerate(mylist)]
Any method where count
is called on each element is going to result in O(n^2)
since count
is O(n)
. You can do something like this:
# not modifying original list
from collections import Counter
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
counts = {k:v for k,v in Counter(mylist).items() if v > 1}
newlist = mylist[:]
for i in reversed(range(len(mylist))):
item = mylist[i]
if item in counts and counts[item]:
newlist[i] += str(counts[item])
counts[item]-=1
print(newlist)
# ['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']
# modifying original list
from collections import Counter
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
counts = {k:v for k,v in Counter(mylist).items() if v > 1}
for i in reversed(range(len(mylist))):
item = mylist[i]
if item in counts and counts[item]:
mylist[i] += str(counts[item])
counts[item]-=1
print(mylist)
# ['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']
This should be O(n)
.
Other provided answers:
mylist.index(s)
per element causes O(n^2)
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
from collections import Counter
counts = Counter(mylist)
for s,num in counts.items():
if num > 1:
for suffix in range(1, num + 1):
mylist[mylist.index(s)] = s + str(suffix)
count(x[1])
per element causes O(n^2)
It is also used multiple times per element along with list slicing.
print map(lambda x: x[1] + str(mylist[:x[0]].count(x[1]) + 1) if mylist.count(x[1]) > 1 else x[1], enumerate(mylist))
Benchmarks:
Less fancy stuff.
from collections import defaultdict
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
finalList = []
dictCount = defaultdict(int)
anotherDict = defaultdict(int)
for t in mylist:
anotherDict[t] += 1
for m in mylist:
dictCount[m] += 1
if anotherDict[m] > 1:
finalList.append(str(m)+str(dictCount[m]))
else:
finalList.append(m)
print finalList
Here’s a very simple O(n)
solution. Simply walk the list storing the index of element in the list. If we’ve seen this element before, use the stored data earlier to append the occurrence value.
This approach solves the problem with just creating one more dictionary for look-back. Avoids doing look-ahead so that we don’t create temporary list slices.
mylist = ["name", "state", "name", "city", "city", "name", "zip", "zip", "name"]
dups = {}
for i, val in enumerate(mylist):
if val not in dups:
# Store index of first occurrence and occurrence value
dups[val] = [i, 1]
else:
# Special case for first occurrence
if dups[val][1] == 1:
mylist[dups[val][0]] += str(dups[val][1])
# Increment occurrence value, index value doesn't matter anymore
dups[val][1] += 1
# Use stored occurrence value
mylist[i] += str(dups[val][1])
print mylist
# ['name1', 'state', 'name2', 'city1', 'city2', 'name3', 'zip1', 'zip2', 'name4']
This is how I would do it. EDIT: I wrote this into a more generalized utility function since people seem to like this answer.
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
check = ["name1", "state", "name2", "city", "name3", "zip1", "zip2"]
copy = mylist[:] # so we will only mutate the copy in case of failure
from collections import Counter # Counter counts the number of occurrences of each item
from itertools import tee, count
def uniquify(seq, suffs = count(1)):
"""Make all the items unique by adding a suffix (1, 2, etc).
`seq` is mutable sequence of strings.
`suffs` is an optional alternative suffix iterable.
"""
not_unique = [k for k,v in Counter(seq).items() if v>1] # so we have: ['name', 'zip']
# suffix generator dict - e.g., {'name': <my_gen>, 'zip': <my_gen>}
suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))
for idx,s in enumerate(seq):
try:
suffix = str(next(suff_gens[s]))
except KeyError:
# s was unique
continue
else:
seq[idx] += suffix
uniquify(copy)
assert copy==check # raise an error if we failed
mylist = copy # success
If you wanted to append an underscore before each count, you could do something like this:
>>> mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
>>> uniquify(mylist, (f'_{x!s}' for x in range(1, 100)))
>>> mylist
['name_1', 'state', 'name_2', 'city', 'name_3', 'zip_1', 'zip_2']
…or if you wanted to use letters instead:
>>> mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
>>> import string
>>> uniquify(mylist, (f'_{x!s}' for x in string.ascii_lowercase))
>>> mylist
['name_a', 'state', 'name_b', 'city', 'name_c', 'zip_a', 'zip_b']
NOTE: this is not the fastest possible algorithm; for that, refer to the answer by ronakg. The advantage of the function above is it is easy to understand and read, and you’re not going to see much of a performance difference unless you have an extremely large list.
EDIT: Here is my original answer in a one-liner, however the order is not preserved and it uses the .index
method, which is extremely suboptimal (as explained in the answer by DTing). See the answer by queezz for a nice ‘two-liner’ that preserves order.
[s + str(suffix) if num>1 else s for s,num in Counter(mylist).items() for suffix in range(1, num+1)]
# Produces: ['zip1', 'zip2', 'city', 'state', 'name1', 'name2', 'name3']
A list comprehension version of the Rick Teachey answer, “two-liner”:
from collections import Counter
m = ["name", "state", "name", "city", "name", "zip", "zip"]
d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(m).items()}
[i+str(d[i].pop(0)) if len(d[i]) else i for i in m]
#['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']
Beware of updated values that already exist in the original list
If the starting list already includes an item "name2"
…
mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name2"]
…then mylist[2]
shouldn’t be updated to "name2"
when the function runs, otherwise a new duplicate will be created; instead, the function should jump to the next available item name "name3"
.
mylist_updated = ['name1', 'state', 'name3', 'city', 'name4', 'zip1', 'zip2', 'name2']
Here’s an alternate solution (can probably be shortened and optimized) which includes a recursive function that checks for these existing items in the original list.
mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name2"]
def fix_dups(mylist, sep='', start=1, update_first=True):
mylist_dups = {}
#build dictionary containing val: [occurrences, suffix]
for val in mylist:
if val not in mylist_dups:
mylist_dups[val] = [1, start - 1]
else:
mylist_dups[val][0] += 1
#define function to update duplicate values with suffix, check if updated value already exists
def update_val(val, num):
temp_val = sep.join([str(x) for x in [val, num]])
if temp_val not in mylist_dups:
return temp_val, num
else:
num += 1
return update_val(val, num)
#update list
for i, val in enumerate(mylist):
if mylist_dups[val][0] > 1:
mylist_dups[val][1] += 1
if update_first or mylist_dups[val][1] > start:
new_val, mylist_dups[val][1] = update_val(val, mylist_dups[val][1])
mylist[i] = new_val
return mylist
mylist_updated = fix_dups(mylist, sep='', start=1, update_first=True)
print(mylist_updated)
#['name1', 'state', 'name3', 'city', 'name4', 'zip1', 'zip2', 'name2']
In case you don’t want to change the first occurrence.
mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name_2"]
mylist_updated = fix_dups(mylist, sep='_', start=0, update_first=False)
print(mylist_updated)
#['name', 'state', 'name_1', 'city', 'name_3', 'zip', 'zip_1', 'name_2']
Given a list like this:
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
I would like to rename the duplicates by appending a number to get the following result:
mylist = ["name1", "state", "name2", "city", "name3", "zip1", "zip2"]
I do not want to change the order of the original list. The solutions suggested for this related Stack Overflow question sorts the list, which I do not want to do.
You can use hashtable to solve this problem. Define a dictionary d. key is the string and value is (first_time_index_in_the_list, times_of_appearance). Everytime when you see a word, just check the dictionary, and if the value is 2, use first_time_index_in_the_list to append ‘1’ to the first element, and append times_of_appearance to current element. If greater than 2, just append times_of_appearance to current element.
My solution with map
and lambda
:
print map(lambda x: x[1] + str(mylist[:x[0]].count(x[1]) + 1) if mylist.count(x[1]) > 1 else x[1], enumerate(mylist))
More traditional form
newlist = []
for i, v in enumerate(mylist):
totalcount = mylist.count(v)
count = mylist[:i].count(v)
newlist.append(v + str(count + 1) if totalcount > 1 else v)
And last one
[v + str(mylist[:i].count(v) + 1) if mylist.count(v) > 1 else v for i, v in enumerate(mylist)]
Any method where count
is called on each element is going to result in O(n^2)
since count
is O(n)
. You can do something like this:
# not modifying original list
from collections import Counter
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
counts = {k:v for k,v in Counter(mylist).items() if v > 1}
newlist = mylist[:]
for i in reversed(range(len(mylist))):
item = mylist[i]
if item in counts and counts[item]:
newlist[i] += str(counts[item])
counts[item]-=1
print(newlist)
# ['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']
# modifying original list
from collections import Counter
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
counts = {k:v for k,v in Counter(mylist).items() if v > 1}
for i in reversed(range(len(mylist))):
item = mylist[i]
if item in counts and counts[item]:
mylist[i] += str(counts[item])
counts[item]-=1
print(mylist)
# ['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']
This should be O(n)
.
Other provided answers:
mylist.index(s)
per element causes O(n^2)
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
from collections import Counter
counts = Counter(mylist)
for s,num in counts.items():
if num > 1:
for suffix in range(1, num + 1):
mylist[mylist.index(s)] = s + str(suffix)
count(x[1])
per element causes O(n^2)
It is also used multiple times per element along with list slicing.
print map(lambda x: x[1] + str(mylist[:x[0]].count(x[1]) + 1) if mylist.count(x[1]) > 1 else x[1], enumerate(mylist))
Benchmarks:
Less fancy stuff.
from collections import defaultdict
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
finalList = []
dictCount = defaultdict(int)
anotherDict = defaultdict(int)
for t in mylist:
anotherDict[t] += 1
for m in mylist:
dictCount[m] += 1
if anotherDict[m] > 1:
finalList.append(str(m)+str(dictCount[m]))
else:
finalList.append(m)
print finalList
Here’s a very simple O(n)
solution. Simply walk the list storing the index of element in the list. If we’ve seen this element before, use the stored data earlier to append the occurrence value.
This approach solves the problem with just creating one more dictionary for look-back. Avoids doing look-ahead so that we don’t create temporary list slices.
mylist = ["name", "state", "name", "city", "city", "name", "zip", "zip", "name"]
dups = {}
for i, val in enumerate(mylist):
if val not in dups:
# Store index of first occurrence and occurrence value
dups[val] = [i, 1]
else:
# Special case for first occurrence
if dups[val][1] == 1:
mylist[dups[val][0]] += str(dups[val][1])
# Increment occurrence value, index value doesn't matter anymore
dups[val][1] += 1
# Use stored occurrence value
mylist[i] += str(dups[val][1])
print mylist
# ['name1', 'state', 'name2', 'city1', 'city2', 'name3', 'zip1', 'zip2', 'name4']
This is how I would do it. EDIT: I wrote this into a more generalized utility function since people seem to like this answer.
mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
check = ["name1", "state", "name2", "city", "name3", "zip1", "zip2"]
copy = mylist[:] # so we will only mutate the copy in case of failure
from collections import Counter # Counter counts the number of occurrences of each item
from itertools import tee, count
def uniquify(seq, suffs = count(1)):
"""Make all the items unique by adding a suffix (1, 2, etc).
`seq` is mutable sequence of strings.
`suffs` is an optional alternative suffix iterable.
"""
not_unique = [k for k,v in Counter(seq).items() if v>1] # so we have: ['name', 'zip']
# suffix generator dict - e.g., {'name': <my_gen>, 'zip': <my_gen>}
suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))
for idx,s in enumerate(seq):
try:
suffix = str(next(suff_gens[s]))
except KeyError:
# s was unique
continue
else:
seq[idx] += suffix
uniquify(copy)
assert copy==check # raise an error if we failed
mylist = copy # success
If you wanted to append an underscore before each count, you could do something like this:
>>> mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
>>> uniquify(mylist, (f'_{x!s}' for x in range(1, 100)))
>>> mylist
['name_1', 'state', 'name_2', 'city', 'name_3', 'zip_1', 'zip_2']
…or if you wanted to use letters instead:
>>> mylist = ["name", "state", "name", "city", "name", "zip", "zip"]
>>> import string
>>> uniquify(mylist, (f'_{x!s}' for x in string.ascii_lowercase))
>>> mylist
['name_a', 'state', 'name_b', 'city', 'name_c', 'zip_a', 'zip_b']
NOTE: this is not the fastest possible algorithm; for that, refer to the answer by ronakg. The advantage of the function above is it is easy to understand and read, and you’re not going to see much of a performance difference unless you have an extremely large list.
EDIT: Here is my original answer in a one-liner, however the order is not preserved and it uses the .index
method, which is extremely suboptimal (as explained in the answer by DTing). See the answer by queezz for a nice ‘two-liner’ that preserves order.
[s + str(suffix) if num>1 else s for s,num in Counter(mylist).items() for suffix in range(1, num+1)]
# Produces: ['zip1', 'zip2', 'city', 'state', 'name1', 'name2', 'name3']
A list comprehension version of the Rick Teachey answer, “two-liner”:
from collections import Counter
m = ["name", "state", "name", "city", "name", "zip", "zip"]
d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(m).items()}
[i+str(d[i].pop(0)) if len(d[i]) else i for i in m]
#['name1', 'state', 'name2', 'city', 'name3', 'zip1', 'zip2']
Beware of updated values that already exist in the original list
If the starting list already includes an item "name2"
…
mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name2"]
…then mylist[2]
shouldn’t be updated to "name2"
when the function runs, otherwise a new duplicate will be created; instead, the function should jump to the next available item name "name3"
.
mylist_updated = ['name1', 'state', 'name3', 'city', 'name4', 'zip1', 'zip2', 'name2']
Here’s an alternate solution (can probably be shortened and optimized) which includes a recursive function that checks for these existing items in the original list.
mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name2"]
def fix_dups(mylist, sep='', start=1, update_first=True):
mylist_dups = {}
#build dictionary containing val: [occurrences, suffix]
for val in mylist:
if val not in mylist_dups:
mylist_dups[val] = [1, start - 1]
else:
mylist_dups[val][0] += 1
#define function to update duplicate values with suffix, check if updated value already exists
def update_val(val, num):
temp_val = sep.join([str(x) for x in [val, num]])
if temp_val not in mylist_dups:
return temp_val, num
else:
num += 1
return update_val(val, num)
#update list
for i, val in enumerate(mylist):
if mylist_dups[val][0] > 1:
mylist_dups[val][1] += 1
if update_first or mylist_dups[val][1] > start:
new_val, mylist_dups[val][1] = update_val(val, mylist_dups[val][1])
mylist[i] = new_val
return mylist
mylist_updated = fix_dups(mylist, sep='', start=1, update_first=True)
print(mylist_updated)
#['name1', 'state', 'name3', 'city', 'name4', 'zip1', 'zip2', 'name2']
In case you don’t want to change the first occurrence.
mylist = ["name", "state", "name", "city", "name", "zip", "zip", "name_2"]
mylist_updated = fix_dups(mylist, sep='_', start=0, update_first=False)
print(mylist_updated)
#['name', 'state', 'name_1', 'city', 'name_3', 'zip', 'zip_1', 'name_2']