How to access nested data in Dask Bag while using dask mongo

Question:

Below is the sample data –

({'age': 61,
  'name': ['Emiko', 'Oliver'],
  'occupation': 'Medical Student',
  'telephone': '166.814.5565',
  'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},
  'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}},
 {'age': 54,
  'name': ['Wendolyn', 'Ortega'],
  'occupation': 'Tractor Driver',
  'telephone': '1-975-090-1672',
  'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},
  'credit-card': {'number': '4600 5899 6829 6887',
   'expiration-date': '11/25'}})

We can apply filter on the dask bag root elemnets as below.
b.filter(lambda record: record[‘age’] > 30).take(2) # Select only people over 30

However I need to access the nested element i.e credit-card.expiration-date
Any help will be appriciated.

Asked By: gauravpks

||

Answers:

You can simply do this:

import dask.bag as db

data = ({'age': 61,
         'name': ['Emiko', 'Oliver'],
         'occupation': 'Medical Student',
         'telephone': '166.814.5565',
         'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},
         'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}},
        {'age': 54,
         'name': ['Wendolyn', 'Ortega'],
         'occupation': 'Tractor Driver',
         'telephone': '1-975-090-1672',
         'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},
         'credit-card': {'number': '4600 5899 6829 6887',
                         'expiration-date': '11/25'}})

bag = db.from_sequence(data)

result = bag.map(lambda record: record['credit-card']['expiration-date']).compute()

print(result)

which returns

['12/23', '11/25']

In those cases where you have several cards per individual, do this:

import dask.bag as db

data = ({
            'age': 61,
            'name': ['Emiko', 'Oliver'],
            'occupation': 'Medical Student',
            'telephone': '166.814.5565',
            'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},
            'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}
        },
        {
            'age': 54,
            'name': ['Wendolyn', 'Ortega'],
            'occupation': 'Tractor Driver',
            'telephone': '1-975-090-1672',
            'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},
            'credit-card': [
                {'number': '4600 5899 6829 6887', 'expiration-date': '11/25'},
                {'number': '4610 5899 6829 6887', 'expiration-date': '11/26'},
            ]
        })

bag = db.from_sequence(data)

result = bag.map(lambda record: record['credit-card']['expiration-date'] 
                  if isinstance(record['credit-card'], dict) 
                  else [card['expiration-date'] for card in record['credit-card']]).compute()

print(result)

which will return

['12/23', ['11/25', '11/26']]