deleting repeating value in nested dictionary

Question:

I know my question is pretty basic, but the answers on the Internet somehow didn’t work. I have a rather long nested dictionary where some values are repeating. Here is a sample slice of my dictionary:

 {'C4QY10_e': {'protein accession': ['C4QY10_e',
   'C4QY10_e',
   'C4QY10_e',
   'C4QY10_e',
   'C4QY10_e'],
  'sequence length': ['1879', '1879', '1879', '1879', '1879'],
  'analysis': ['Pfam', 'Pfam', 'Pfam', 'Pfam', 'Pfam'],
  'signature accession': ['PF18314',
   'PF02801',
   'PF18325',
   'PF00109',
   'PF01648'],
  'signature description': ['Fatty acid synthase type I helical domain',
   'Beta-ketoacyl synthase',
   'Fatty acid synthase subunit alpha Acyl carrier domain',
   'Beta-ketoacyl synthase',
   "4'-phosphopantetheinyl transferase superfamily"],
  'start location': ['328', None, '139', None, '1761'],
  'stop location': ['528', None, '300', None, '1861'],
  'e-value': ['4.7E-73', None, '1.3E-72', None, '1.4E-18'],
  'interpro accession': ['IPR041550', None, 'IPR040899', None, 'IPR008278'],
  'interpro description': ['Fatty acid synthase type I',
   None,
   'Fatty acid synthase subunit alpha',
   None,
   "4'-phosphopantetheinyl transferase domain"],
  'nunique': [1, 1, 1, 1, 1],
  'domain_count': [5, 5, 5, 5, 5]},

As you can see values are repeating and also in some of the keys there are None values. How can I fix it?

Asked By: Aurinko

||

Answers:

You can remove duplicates and None values from the dictionary like this

for k,v in d['C4QY10_e'].items():
    d['C4QY10_e'][k] = list(set(filter(None, v)))

Output

{'C4QY10_e': {'protein accession': ['C4QY10_e'],
  'sequence length': ['1879'],
  'analysis': ['Pfam'],
  'signature accession': ['PF18325',
   'PF18314',
   'PF00109',
   'PF01648',
   'PF02801'],
  'signature description': ['Beta-ketoacyl synthase',
   'Fatty acid synthase type I helical domain',
   'Fatty acid synthase subunit alpha Acyl carrier domain',
   "4'-phosphopantetheinyl transferase superfamily"],
  'start location': ['1761', '139', '328'],
  'stop location': ['300', '528', '1861'],
  'e-value': ['1.3E-72', '4.7E-73', '1.4E-18'],
  'interpro accession': ['IPR008278', 'IPR041550', 'IPR040899'],
  'interpro description': ['Fatty acid synthase type I',
   "4'-phosphopantetheinyl transferase domain",
   'Fatty acid synthase subunit alpha'],
  'nunique': [1],
  'domain_count': [5]}}

There are no keys that are repeated, keys are repeated at different levels.

{'C4QY10_e': {'analysis': ['Pfam', 'Pfam', 'Pfam', 'Pfam', 'Pfam'],
              'domain_count': [5, 5, 5, 5, 5],
              'e-value': ['4.7E-73', None, '1.3E-72', None, '1.4E-18'],
              'interpro accession': ['IPR041550',
                                     None,
                                     'IPR040899',
                                     None,
                                     'IPR008278'],
              'interpro description': ['Fatty acid synthase type I',
                                       None,
                                       'Fatty acid synthase subunit alpha',
                                       None,
                                       "4'-phosphopantetheinyl transferase "
                                       'domain'],
              'nunique': [1, 1, 1, 1, 1],
              'protein accession': ['C4QY10_e',
                                    'C4QY10_e',
                                    'C4QY10_e',
                                    'C4QY10_e',
                                    'C4QY10_e'],
              'sequence length': ['1879', '1879', '1879', '1879', '1879'],
              'signature accession': ['PF18314',
                                      'PF02801',
                                      'PF18325',
                                      'PF00109',
                                      'PF01648'],
              'signature description': ['Fatty acid synthase type I helical '
                                        'domain',
                                        'Beta-ketoacyl synthase',
                                        'Fatty acid synthase subunit alpha '
                                        'Acyl carrier domain',
                                        'Beta-ketoacyl synthase',
                                        "4'-phosphopantetheinyl transferase "
                                        'superfamily'],
              'start location': ['328', None, '139', None, '1761'],
              'stop location': ['528', None, '300', None, '1861']}}

You can check on the level here.

And to remove the None values from the list you can do this,

In [1]: list(filter(None, ['328', None, '139', None, '1761']))
Out[1]: ['328', '139', '1761']
Answered By: Rahul K P

Since some lists contain duplicates you can use this code to construct an ordered list

ordered_list = list(dict.fromkeys(duplicated_list))
Answered By: Kiran S
dictionary = {
    'C4QY10_e':
        {'protein accession':
         ['C4QY10_e',
          'C4QY10_e',
          'C4QY10_e',
          'C4QY10_e',
          'C4QY10_e'],
         'sequence length': ['1879', '1879', '1879', '1879', '1879'],
         'analysis': ['Pfam', 'Pfam', 'Pfam', 'Pfam', 'Pfam'],
         'signature accession': ['PF18314',
                                 'PF02801',
                                 'PF18325',
                                 'PF00109',
                                 'PF01648'],
         'signature description': ['Fatty acid synthase type I helical domain',
                                   'Beta-ketoacyl synthase',
                                   'Fatty acid synthase subunit alpha Acyl carrier domain',
                                   'Beta-ketoacyl synthase',
                                   "4'-phosphopantetheinyl transferase superfamily"],
         'start location': ['328', None, '139', None, '1761'],
         'stop location': ['528', None, '300', None, '1861'],
         'e-value': ['4.7E-73', None, '1.3E-72', None, '1.4E-18'],
         'interpro accession': ['IPR041550', None, 'IPR040899', None, 'IPR008278'],
         'interpro description': ['Fatty acid synthase type I',
                                  None,
                                  'Fatty acid synthase subunit alpha',
                                  None,
                                  "4'-phosphopantetheinyl transferase domain"],
         'nunique': [1, 1, 1, 1, 1],
         'domain_count': [5, 5, 5, 5, 5]},
}


def remove_repeating_value(dictionary):
    for _ , value in dictionary.items():
        for key1, value1 in value.items():
            if value1[0] == value1[1] or value1[1] is None:
                value[key1] = value1[0]
            elif value1[0] is None:
                value[key1] = value1[1]
            else:
                value[key1] = value1
    return dictionary


# calling function
print(remove_repeating_value(dictionary))

output

dictionary = {
    'C4QY10_e':
        {'protein accession': ['C4QY10_e'],
            'sequence length': ['1879'],
            'analysis': ['Pfam'],
            'signature accession': ['PF18314', 'PF02801', 'PF18325', 'PF00109', 'PF01648'],
            'signature description': ['Fatty acid synthase type I helical domain',
                                        'Beta-ketoacyl synthase',
                                        'Fatty acid synthase subunit alpha Acyl carrier domain',
                                        'Beta-ketoacyl synthase',
                                        "4'-phosphopantetheinyl transferase superfamily"],
            'start location': ['328', '139', '1761'],
            'stop location': ['528', '300', '1861'],
            'e-value': ['4.7E-73', '1.3E-72', '1.4E-18'],
            'interpro accession': ['IPR041550', 'IPR040899', 'IPR008278'],
            'interpro description': ['Fatty acid synthase type I',
                                        'Fatty acid synthase subunit alpha', 
                                        "4'-phosphopantetheinyl transferase domain"],
            'nunique': [1],
            'domain_count': [5]}
}
Answered By: amd

I think, you want to avoid duplication and None values in arrays. For this try the function:

def remove_duplication_from_arr_of_nested_dict(input_dict):
   for key, value in input_dict.items():
      if type(value) is dict:
          input_dict[key] = remove_duplication_from_arr_of_nested_dict(value)
      elif type(value) is list:
          value = set(value)
          if None in value:
              value.remove(None)
          input_dict[key] = list(value)
      return input_dict

Just pass your dictionary to the function, it will remove the duplicates and Nones from lists. If you want to remove duplicated keys at different levels, use same approach to list down all the existing keys and check whether key is in the list or not and delete accordingly.

Answered By: Shashi Kant

You can convert a list to a set then back to a list to remove duplicates. However, you need to bear in mind that the original order of the list may not be retained.

Removal of None values can be dealt with in the same list comprehension that’s used to reconstruct the list.

def fix(d):
    for v in d.values():
        if isinstance(v, dict):
            fix(v)
        elif isinstance(v, list):
            v[:] = [e for e in set(v) if e is not None]
    return d

print(fix(data))

If keeping order in the lists is important then:

def fix(d):
    for v in d.values():
        if isinstance(v, dict):
            fix(v)
        elif isinstance(v, list):
            v[:] = list({e: None for e in v if e is not None})
    return d

print(fix(data))
Answered By: OldBill
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.