case-sensitive list sorting, but just the duplicate values?

Question:

I have a list of strings like this:
['A', 'b','C','adam','ADam','EVe','eve','Eve','d','Adam']

I need to sort only the duplicate values only in string order to get output as
['A', 'b','C','ADam','Adam','adam','EVe','Eve','eve','d']

Here ‘ADam’,’Adam’,’adam’ were originally at different places in the list, but by standard ordering, they should be like this. Hence when the sorting method sees ‘adam’, it should try to find duplicates, sort and reorder the list as in the output for all adam’s(Case Sensitive Order)
Please note all the other values remain as is. i.e ‘A’, ‘b’,’C’,’d’ all remain in original positions

I am able to do a standard sort or write complex code to do this work but I am looking for some existing and optimal mechanism as this list can be huge (Billions of records).So efficiency is crucial

Any ideas or pointers to existing library of code snippets helps
Thanks in advance.

Asked By: Pradeepta Dinda

||

Answers:

Try:

lst = ["A", "b", "C", "adam", "ADam", "EVe", "eve", "Eve", "d", "Adam"]

tmp = {}
for i, word in enumerate(map(str.lower, lst)):
    if word not in tmp:
        tmp[word] = i

lst = sorted(lst, key=lambda w: (tmp[w.lower()], w))
print(lst)

Prints:

['A', 'b', 'C', 'ADam', 'Adam', 'adam', 'EVe', 'Eve', 'eve', 'd']

A benchmark comparing mine and others answers:

import numpy as np
import pandas as pd
from timeit import timeit
from itertools import chain
from collections import defaultdict

lst = ["A", "b", "C", "adam", "ADam", "EVe", "eve", "Eve", "d", "Adam"]


def sort_1(lst):
    tmp = {}
    for i, word in enumerate(map(str.lower, lst)):
        if word not in tmp:
            tmp[word] = i

    lst.sort(key=lambda w: (tmp[w.lower()], w))
    return lst


def sort_2(s):
    return s.iloc[np.lexsort([s, pd.factorize(s.str.lower())[0]])]


def sort_3(lst):
    d = defaultdict(list)
    for word in lst:
        d[word.casefold()].append(word)
    return list(chain.from_iterable(map(sorted, d.values())))


t1 = timeit("sort_1(l)", setup="l = lst*10_000", number=1, globals=globals())
t2 = timeit(
    "sort_2(s)", setup="s = pd.Series(lst*10_000)", number=1, globals=globals()
)
t3 = timeit("sort_3(l)", setup="l = lst*10_000", number=1, globals=globals())

print(t1)
print(t2)
print(t3)

Prints on my machine Python 3.9/AMD 3700x:

0.04896668600849807
0.05656355991959572
0.015082631958648562
Answered By: Andrej Kesely

Using pandas+numpy:

import pandas as pd
import numpy as np

l = ['A', 'b','C','adam','ADam','EVe','eve','Eve','d','Adam']

s = pd.Series(l)

s.iloc[np.lexsort([s, pd.factorize(s.str.lower())[0]])]

Output:

0       A
1       b
2       C
4    ADam
9    Adam
3    adam
5     EVe
7     Eve
6     eve
8       d
dtype: object
Answered By: mozway

Using the fact that python dictionaries are ordered as of version 3.6:

from collections import defaultdict
from itertools import chain

d = defaultdict(list)
for word in lst:
    d[word.casefold()].append(word)
result = list(chain.from_iterable(map(sorted, d.values())))
Answered By: Mad Physicist