How to transform a DataFrame with a complicated series in a new DataFrame
Question:
I’m going through a hard time trying to transform a dataframe with 2 columns into another DataFrame. The first column is my index (ints) and the another column is a complicated series. For what i’m able to see the structure of the series goes like this:
A dictionary with one key and one value. The value is a list of simple dictionaries with key/value pairs.
My DataFrame looks like that:
Series:
{
"CashFlowDto": [
{
"TicketId": None,
"Type": "Amrt",
"Amount": 560.61,
"PercentualAmount": 0.0494481,
"MaturityDate": datetime.datetime(2023, 7, 10, 0, 0),
"PaymentDate": datetime.datetime(2023, 7, 10, 0, 0),
},
{
"TicketId": None,
"Type": "Amrt",
"Amount": 552.05,
"PercentualAmount": 0.048693,
"MaturityDate": datetime.datetime(2023, 8, 10, 0, 0),
"PaymentDate": datetime.datetime(2023, 8, 10, 0, 0),
}
]}
My desired output:
Could you guys help me, please?
Thanks
Answers:
Here’s one approach:
- First, use
Series.tolist
on column CashFlowDto
and use within pd.DataFrame
. See this SO answer.
- Next, repeat the result n times (i.e.
n = len(df)
) using pd.concat
, and make sure to put parameter ignore_index
to True
.
- Now, also get a repeat for
df['TicketId']
, for which we can use np.repeat
, and keep only the values (using Series.to_numpy
; alternatively, reset Series.index
).
- Finally, combine the new
df
and the repeats for df['TicketId']
, using df.assign
.
n = len(df)
res = (pd.concat([pd.DataFrame(df['CashFlowDto'].tolist())]*n,ignore_index=True)
.assign(TicketId=np.repeat(df.TicketId, n).to_numpy()))
res
TicketId Type Amount PercentualAmount MaturityDate PaymentDate
0 1 Amrt 560.61 0.049448 2023-07-10 2023-07-10
1 1 Amrt 552.05 0.048693 2023-08-10 2023-08-10
2 2 Amrt 560.61 0.049448 2023-07-10 2023-07-10
3 2 Amrt 552.05 0.048693 2023-08-10 2023-08-10
There’s probably a more elegant way to do this by applying pd.json_normalize
to your original data, but I’ll suggest a solution using list comprehension (and zip
).
If your current DataFrame is named tickets_df
, then you can try
cashflows_df = pd.DataFrame([{'Ticket': tId, **{
k: v for k, v in cfd.items() if k != 'TicketId'
}} for tId, cf in zip(
# tickets_df['TicketId'], tickets_df['CashFlows']
tickets_df.index, tickets_df['CashFlows'] # if TicketId is the index
) for cfd in cf['CashFlowDto']])
(I edited the Type
field just to demonstrate that the rows are separate as they should be.)
I’m going through a hard time trying to transform a dataframe with 2 columns into another DataFrame. The first column is my index (ints) and the another column is a complicated series. For what i’m able to see the structure of the series goes like this:
A dictionary with one key and one value. The value is a list of simple dictionaries with key/value pairs.
My DataFrame looks like that:
Series:
{
"CashFlowDto": [
{
"TicketId": None,
"Type": "Amrt",
"Amount": 560.61,
"PercentualAmount": 0.0494481,
"MaturityDate": datetime.datetime(2023, 7, 10, 0, 0),
"PaymentDate": datetime.datetime(2023, 7, 10, 0, 0),
},
{
"TicketId": None,
"Type": "Amrt",
"Amount": 552.05,
"PercentualAmount": 0.048693,
"MaturityDate": datetime.datetime(2023, 8, 10, 0, 0),
"PaymentDate": datetime.datetime(2023, 8, 10, 0, 0),
}
]}
My desired output:
Could you guys help me, please?
Thanks
Here’s one approach:
- First, use
Series.tolist
on columnCashFlowDto
and use withinpd.DataFrame
. See this SO answer. - Next, repeat the result n times (i.e.
n = len(df)
) usingpd.concat
, and make sure to put parameterignore_index
toTrue
. - Now, also get a repeat for
df['TicketId']
, for which we can usenp.repeat
, and keep only the values (usingSeries.to_numpy
; alternatively, resetSeries.index
). - Finally, combine the new
df
and the repeats fordf['TicketId']
, usingdf.assign
.
n = len(df)
res = (pd.concat([pd.DataFrame(df['CashFlowDto'].tolist())]*n,ignore_index=True)
.assign(TicketId=np.repeat(df.TicketId, n).to_numpy()))
res
TicketId Type Amount PercentualAmount MaturityDate PaymentDate
0 1 Amrt 560.61 0.049448 2023-07-10 2023-07-10
1 1 Amrt 552.05 0.048693 2023-08-10 2023-08-10
2 2 Amrt 560.61 0.049448 2023-07-10 2023-07-10
3 2 Amrt 552.05 0.048693 2023-08-10 2023-08-10
There’s probably a more elegant way to do this by applying pd.json_normalize
to your original data, but I’ll suggest a solution using list comprehension (and zip
).
If your current DataFrame is named tickets_df
, then you can try
cashflows_df = pd.DataFrame([{'Ticket': tId, **{
k: v for k, v in cfd.items() if k != 'TicketId'
}} for tId, cf in zip(
# tickets_df['TicketId'], tickets_df['CashFlows']
tickets_df.index, tickets_df['CashFlows'] # if TicketId is the index
) for cfd in cf['CashFlowDto']])
(I edited theType
field just to demonstrate that the rows are separate as they should be.)