Dedupe a list of dicts where the match criteria is multiple key value pairs being identical

Question:

For the given sample input list, I want to dedupe the dicts based on the values of the keys code, tc, signal, and in_force all matching.

sample input:
signals = [
    None,
    None,
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 1, 'target': 0},
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 2, 'target': 1},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 3, 'target': 2},
    None,
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 4, 'target': 3},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 5, 'target': 4},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 6, 'target': 5},
    None,
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 7, 'target': 6},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 8, 'target': 7},
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 9, 'target': 8},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 0, 'target': 9},
]
expected/desired output:
[
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 1, 'target': 0},
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 2, 'target': 1},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 3, 'target': 2},
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 4, 'target': 3},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 5, 'target': 4},
] 

The order of the list does not need to be preserved, and whether it returns the 1st or nth matching dict in the list does not matter.

I could make a very verbose version of this reference code that creates each list of matching key/values, but I feel like there’s got to be a better way.

new_list = []
for position, signal in enumerate(signals):
    if type(signal) == dict:
            if {
                key: value
                for key, value in signal.items()
                if signal["code"] == "sr"
                and signal["tc"] == 0
                and signal["signal"] == "2U-2D"
                and signal["in_force"] == True
            }:
                new_list.append(signal)
Asked By: Jason

||

Answers:

I’d suggest something like this, with only Python’s standard library:

result = []
seen = set()
for s in signals:
  if not isinstance(s, dict): continue
  signature = (s['code'], s['tc'], s['signal'], s['in_force'])
  if signature in seen: continue
  seen.add(signature)
  result.append(s)
Answered By: nickie
import pandas as pd
new_list = pd.Series([s for s in signals if isinstance(s, dict)])
keys = ['code', 'tc', 'signal', 'in_force']
idx = new_list.apply(lambda x: {x[k] for k in keys}).duplicated()
new_list = new_list[idx].tolist()
Answered By: Michael Hodel

I don’t know if that is wanted but pandas could be come in quite handy here. Also if you have some other tasks to do with the data, a dataframe is a convenient way to do it.

import pandas as pd
# filter None to only have a list of dicts, then create a df with it
df = pd.DataFrame(filter(None,signals)) 

out = df.drop_duplicates(subset=['code', 'tc', 'signal', 'in_force'], keep='first')

out.to_dict('records')

Output:

[{'code': 'sr',
  'tc': 0,
  'signal': '2U-2D',
  'in_force': True,
  'trigger': 1,
  'target': 0},
 {'code': 'lr',
  'tc': 0,
  'signal': '2U-2D',
  'in_force': True,
  'trigger': 2,
  'target': 1},
 {'code': 'sr',
  'tc': 1,
  'signal': '2U-2D',
  'in_force': True,
  'trigger': 3,
  'target': 2},
 {'code': 'sr',
  'tc': 0,
  'signal': '1-2U-2D',
  'in_force': True,
  'trigger': 4,
  'target': 3},
 {'code': 'sr',
  'tc': 0,
  'signal': '2U-2D',
  'in_force': False,
  'trigger': 5,
  'target': 4}]
Answered By: Rabinzel

I found a solution that fits into 1 line of code and does not use any external libraries.

To begin with, let’s filter out all None values:

signals = filter(lambda x: not x is None, signals)

or

signals = [signal for signal in signals if not signal is None]

Now let’s create a dict where keys will be string repr representations of code, tc, signal, and in_force values of our input dicts (this should work until there’s only simple types of values) and the values will be the complete dicts (consistent of all keys). As a dict may not contain several equal keys, all the duplications will be gone:

filter_dict = {repr([signal[key] for key in ('code', 'tc', 'signal', 'in_force')]): signal for signal in signals}

Here’s what I’ve got at this point:

{
    "['sr', 0, '2U-2D', True]": {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 6, 'target': 5},
    "['lr', 0, '2U-2D', True]": {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 7, 'target': 6},
    "['sr', 1, '2U-2D', True]": {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 8, 'target': 7},
    "['sr', 0, '1-2U-2D', True]": {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 9, 'target': 8},
    "['sr', 0, '2U-2D', False]": {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 0, 'target': 9}
}

Now let’s just take the values of that dict, and its all done!:

result = list(filter_dict.values())

All these steps may be joined into 1 line of code:

result = list({repr([signal[key] for key in ('code', 'tc', 'signal', 'in_force')]): signal for signal in signals if not signal is None}.values()) # speed: 9.7e-6 seconds per iteration

Final result:

[
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 6, 'target': 5},
    {'code': 'lr', 'tc': 0, 'signal': '2U-2D', 'in_force': True, 'trigger': 7, 'target': 6},
    {'code': 'sr', 'tc': 1, 'signal': '2U-2D', 'in_force': True, 'trigger': 8, 'target': 7},
    {'code': 'sr', 'tc': 0, 'signal': '1-2U-2D', 'in_force': True, 'trigger': 9, 'target': 8},
    {'code': 'sr', 'tc': 0, 'signal': '2U-2D', 'in_force': False, 'trigger': 0, 'target': 9}
]

May be my solution is not fastest (because I’m using strings) and it may not work with all possible classes that may be in the original dicts (because some classes may not convert into strings correctly by repr function). But at least it’s very simple.

Update:

It turns out that tuple may be used instead of repr (see comments). This should be the best (also the fastest) solution:

result = list({tuple(signal[key] for key in ('code', 'tc', 'signal', 'in_force')): signal for signal in signals if not signal is None}.values()) # speed: 6.4e-6 seconds per iteration

Use filter to skip the None entries and keep tuples of "seen" values in a set for efficient checking.

import operator

seen = set()
clean = []

# Function to get the values for the keys that we are interested in.
getter = operator.itemgetter('code', 'tc', 'signal', 'in_force')

for signal in filter(None, signals):
    if (vals := getter(signal)) in seen:
        # We have already got a dict with these values - skip.
        continue
    seen.add(vals)
    clean.append(signal)

assert len(clean) == len(expected)
assert all(item in expected for item in clean)
Answered By: snakecharmerb