combine multiple dict with different keys but to a dataframe

Question:

I am trying to create a dataframe from a dict with different key names.

Here is a MWE:

# loads price data
from yahoofinancials import YahooFinancials  

yahoo_tickers  = ['SMT.L', 'MSFT', 'NIO']
yahoo_financials = YahooFinancials(yahoo_tickers)
data = yahoo_financials.get_historical_price_data(start_date='1970-01-30',
                                                      end_date='2022-12-30',
                                                      time_interval='daily')

This prints a dictionary with the three keys (here the name of the yahoo tickers) with information about each stock. This is stored as another dictionary within a dictionary.

I tried to clean up this dictvia loops, but end up being quite a slow process, especially with many more stocks.

Can someone suggest something that I can quickly convert the data dictin a dataframe that looks like this:

My expected result:

Out[11]: 
              high     low    open  ...   adjclose  yahoo_ticker  instrumentType
Date                                ...                                         
1968-12-31   4.852   4.852   4.852  ...   1.762543         SMT.L          EQUITY
1969-01-01   4.852   4.852   4.852  ...   1.762543         SMT.L          EQUITY
1969-01-02   4.852   4.852   4.852  ...   1.762543         SMT.L          EQUITY
1969-01-03   4.852   4.852   4.852  ...   1.762543         SMT.L          EQUITY
1969-01-06   4.852   4.852   4.852  ...   1.762543         SMT.L          EQUITY
            ...     ...     ...  ...        ...           ...             ...
2022-12-23  11.220  10.690  11.220  ...  10.970000           NIO          EQUITY
2022-12-27  10.610   9.970  10.530  ...  10.060000           NIO          EQUITY
2022-12-28  10.250   9.610  10.010  ...   9.800000           NIO          EQUITY
2022-12-29  10.270   9.770   9.920  ...   9.990000           NIO          EQUITY
2022-12-30   9.980   9.520   9.830  ...   9.750000           NIO          EQUITY
Asked By: msh855

||

Answers:

Using json_normalize(). Since their isn’t just 1 main key you need to normalize with a list comp and concat the results. The nested column "prices" will be used in the record_path to flatten the data. Using method chaining to clean up and format df.

import pandas as pd
from yahoofinancials import YahooFinancials


yahoo_tickers = ["SMT.L", "MSFT", "NIO"]
yahoo_financials = YahooFinancials(yahoo_tickers)
data = yahoo_financials.get_historical_price_data(
    start_date="1970-01-30",
    end_date="2022-12-30",
    time_interval="daily"
)

df = (
    pd
    .concat([pd.json_normalize(data=data.get(x), record_path=["prices"], meta=["instrumentType"]) for x in data], keys=data.keys())
    .droplevel(level=1)
    .reset_index(names="yahoo_ticker")
    .drop(columns="date")
    .set_index("formatted_date")
    .rename(columns={"formatted_date": "date"})
)

print(df)

Output:

               yahoo_ticker    high     low  ...    volume   adjclose  instrumentType
formatted_date                               ...                                     
1970-01-30            SMT.L   3.852   3.852  ...         0   1.399282          EQUITY
1970-02-02            SMT.L   3.852   3.852  ...         0   1.399282          EQUITY
1970-02-03            SMT.L   3.852   3.852  ...         0   1.399282          EQUITY
1970-02-04            SMT.L   3.852   3.852  ...         0   1.399282          EQUITY
1970-02-05            SMT.L   3.852   3.852  ...         0   1.399282          EQUITY
...                     ...     ...     ...  ...       ...        ...             ...
2022-12-22              NIO  11.580  10.760  ...  32468900  11.290000          EQUITY
2022-12-23              NIO  11.220  10.690  ...  33610000  10.970000          EQUITY
2022-12-27              NIO  10.610   9.970  ...  54165700  10.060000          EQUITY
2022-12-28              NIO  10.250   9.610  ...  42225300   9.800000          EQUITY
2022-12-29              NIO  10.270   9.770  ...  49380200   9.990000          EQUITY
Answered By: Jason Baker

I would suggest using list compression to flatten the nested dictionary also it will be much more efficient than using pandas normalization methods.

pd.DataFrame([
    {
        **p,
        'yahooTicker': k, 
        'instrumentType': v['instrumentType']
    } for k, v in data.items() for p in v['prices']
]).drop(columns=['date']).set_index('formatted_date')

Result

                 high    low   open  close  volume  adjclose yahooTicker instrumentType
formatted_date                                                                         
1970-01-30      3.852  3.852  3.852  3.852       0  1.399282       SMT.L         EQUITY
1970-02-02      3.852  3.852  3.852  3.852       0  1.399282       SMT.L         EQUITY
1970-02-03      3.852  3.852  3.852  3.852       0  1.399282       SMT.L         EQUITY
1970-02-04      3.852  3.852  3.852  3.852       0  1.399282       SMT.L         EQUITY
1970-02-05      3.852  3.852  3.852  3.852       0  1.399282       SMT.L         EQUITY
...
Answered By: Shubham Sharma
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.