MultiIndex pandas dataframe and writing to Google Sheets using gspread-pandas
Question:
Starting with the following dictionary:
test_dict = {'header1_1': {'header2_1': {'header3_1': {'header4_1': ['322.5', 330.0, -0.28],
'header4_2': ['322.5', 332.5, -0.26]},
'header3_2': {'header4_1': ['285.0', 277.5, -0.09],
'header4_2': ['287.5', 277.5, -0.12]}},
'header2_2': {'header3_1': {'header4_1': ['345.0', 357.5, -0.14],
'header4_2': ['345.0', 362.5, -0.14]},
'header3_2': {'header4_1': ['257.5', 245.0, -0.1],
'header4_2': ['257.5', 240.0, -0.08]}}}}
I want the headers in the index, so I reform the dictionary:
reformed_dict = {}
for outerKey, innerDict in test_dict.items():
for innerKey, innerDict2 in innerDict.items():
for innerKey2, innerDict3 in innerDict2.items():
for innerKey3, values in innerDict3.items():
reformed_dict[(outerKey,
innerKey, innerKey2, innerKey3)] = values
And assign column names to the headers:
keys = reformed_dict.keys()
values = reformed_dict.values()
index = pd.MultiIndex.from_tuples(keys, names=["H1", "H2", "H3", "H4"])
df = pd.DataFrame(data=values, index=index)
That gets to a dataframe that looks like this:
Issue #1 [*** this has been answered by @AzharKhan, so feel free to skip ahead to Issue #2 ***]: To assign names to the data columns, I tried:
df.columns = ['col 1', 'col 2' 'col 3']
and got error: "ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements"
Then per a suggestion, I tried:
df = df.rename(columns={'0': 'Col1', '1': 'Col2', '2': 'Col3'})
This does not generate an error, but the dataframe looks exactly the same as before, with 0, 1, 2 as the data column headers.
How can I assign names to these data columns? I assume 0, 1, 2 are column indices, not column names.
Issue #2: When I write this dataframe to Google Sheets using gspread-pandas:
s.open_sheet('test')
Spread.df_to_sheet(s, df, index=True, headers=True, start='A8', replace=False)
This is how the dataframe appears in Jupyter notebook screenshot earlier, so it seems the process of writing to spreadsheet is filling in the empty row headers, which makes the table harder to read at a glance.
How can I get the output to spreadsheet to omit the row headers until they have changed, and thus get the second spreadsheet output?
Answers:
Issue #1
Your columns are numbers (not strings). You can see it by:
print(df.columns)
[Out]:
RangeIndex(start=0, stop=3, step=1)
Use numbers in df.rename()
as follows:
df = df.rename(columns={0: 'Col1', 1: 'Col2', 2: 'Col3'})
print(df.columns)
print(df)
[Out]:
Index(['Col1', 'Col2', 'Col3'], dtype='object')
Col1 Col2 Col3
H1 H2 H3 H4
header1_1 header2_1 header3_1 header4_1 322.5 330.0 -0.28
header4_2 322.5 332.5 -0.26
header3_2 header4_1 285.0 277.5 -0.09
header4_2 287.5 277.5 -0.12
header2_2 header3_1 header4_1 345.0 357.5 -0.14
header4_2 345.0 362.5 -0.14
header3_2 header4_1 257.5 245.0 -0.10
header4_2 257.5 240.0 -0.08
Or if you want to generalise it rather than hard coding then use:
df = df.rename(columns={i:f"Col{i+1}" for i in df.columns})
I am not sure about your issue #2. You may want to carve it out into a separate question to get attention.
Here is a way to handle issue #1 by using pd.json_normalize()
df = pd.json_normalize(test_dict,max_level=3).stack().droplevel(0)
idx = df.index.map(lambda x: tuple(x.split('.'))).rename(['H1','H2','H3','H4'])
df = pd.DataFrame(df.tolist(),index = idx,columns = ['col1','col2','col3'])
Output:
col1 col2 col3
H1 H2 H3 H4
header1_1 header2_1 header3_1 header4_1 322.5 330.0 -0.28
header4_2 322.5 332.5 -0.26
header3_2 header4_1 285.0 277.5 -0.09
header4_2 287.5 277.5 -0.12
header2_2 header3_1 header4_1 345.0 357.5 -0.14
header4_2 345.0 362.5 -0.14
header3_2 header4_1 257.5 245.0 -0.10
header4_2 257.5 240.0 -0.08
Issue #2 is tricky because Jupyter notebook displays the index with the "blank" values, but if you were to do df.index
, it would show that all the data is actually there. Its just a visual choice used by Jupyter notebooks.
In order to achieve this, you can check for value changes and join newly created df.
idx_df = df.index.to_frame().reset_index(drop=True)
df = idx_df.where(idx_df.ne(idx_df.shift())).join(df.reset_index(drop=True))
The creator of gspread-pandas has added the functionality to merge indexes when writing a dataframe to Google Sheets. It’s not yet in general release version of gspread-pandas, but can be found here: https://github.com/aiguofer/gspread-pandas/pull/92
Starting with the following dictionary:
test_dict = {'header1_1': {'header2_1': {'header3_1': {'header4_1': ['322.5', 330.0, -0.28],
'header4_2': ['322.5', 332.5, -0.26]},
'header3_2': {'header4_1': ['285.0', 277.5, -0.09],
'header4_2': ['287.5', 277.5, -0.12]}},
'header2_2': {'header3_1': {'header4_1': ['345.0', 357.5, -0.14],
'header4_2': ['345.0', 362.5, -0.14]},
'header3_2': {'header4_1': ['257.5', 245.0, -0.1],
'header4_2': ['257.5', 240.0, -0.08]}}}}
I want the headers in the index, so I reform the dictionary:
reformed_dict = {}
for outerKey, innerDict in test_dict.items():
for innerKey, innerDict2 in innerDict.items():
for innerKey2, innerDict3 in innerDict2.items():
for innerKey3, values in innerDict3.items():
reformed_dict[(outerKey,
innerKey, innerKey2, innerKey3)] = values
And assign column names to the headers:
keys = reformed_dict.keys()
values = reformed_dict.values()
index = pd.MultiIndex.from_tuples(keys, names=["H1", "H2", "H3", "H4"])
df = pd.DataFrame(data=values, index=index)
That gets to a dataframe that looks like this:
Issue #1 [*** this has been answered by @AzharKhan, so feel free to skip ahead to Issue #2 ***]: To assign names to the data columns, I tried:
df.columns = ['col 1', 'col 2' 'col 3']
and got error: "ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements"
Then per a suggestion, I tried:
df = df.rename(columns={'0': 'Col1', '1': 'Col2', '2': 'Col3'})
This does not generate an error, but the dataframe looks exactly the same as before, with 0, 1, 2 as the data column headers.
How can I assign names to these data columns? I assume 0, 1, 2 are column indices, not column names.
Issue #2: When I write this dataframe to Google Sheets using gspread-pandas:
s.open_sheet('test')
Spread.df_to_sheet(s, df, index=True, headers=True, start='A8', replace=False)
This is how the dataframe appears in Jupyter notebook screenshot earlier, so it seems the process of writing to spreadsheet is filling in the empty row headers, which makes the table harder to read at a glance.
How can I get the output to spreadsheet to omit the row headers until they have changed, and thus get the second spreadsheet output?
Issue #1
Your columns are numbers (not strings). You can see it by:
print(df.columns)
[Out]:
RangeIndex(start=0, stop=3, step=1)
Use numbers in df.rename()
as follows:
df = df.rename(columns={0: 'Col1', 1: 'Col2', 2: 'Col3'})
print(df.columns)
print(df)
[Out]:
Index(['Col1', 'Col2', 'Col3'], dtype='object')
Col1 Col2 Col3
H1 H2 H3 H4
header1_1 header2_1 header3_1 header4_1 322.5 330.0 -0.28
header4_2 322.5 332.5 -0.26
header3_2 header4_1 285.0 277.5 -0.09
header4_2 287.5 277.5 -0.12
header2_2 header3_1 header4_1 345.0 357.5 -0.14
header4_2 345.0 362.5 -0.14
header3_2 header4_1 257.5 245.0 -0.10
header4_2 257.5 240.0 -0.08
Or if you want to generalise it rather than hard coding then use:
df = df.rename(columns={i:f"Col{i+1}" for i in df.columns})
I am not sure about your issue #2. You may want to carve it out into a separate question to get attention.
Here is a way to handle issue #1 by using pd.json_normalize()
df = pd.json_normalize(test_dict,max_level=3).stack().droplevel(0)
idx = df.index.map(lambda x: tuple(x.split('.'))).rename(['H1','H2','H3','H4'])
df = pd.DataFrame(df.tolist(),index = idx,columns = ['col1','col2','col3'])
Output:
col1 col2 col3
H1 H2 H3 H4
header1_1 header2_1 header3_1 header4_1 322.5 330.0 -0.28
header4_2 322.5 332.5 -0.26
header3_2 header4_1 285.0 277.5 -0.09
header4_2 287.5 277.5 -0.12
header2_2 header3_1 header4_1 345.0 357.5 -0.14
header4_2 345.0 362.5 -0.14
header3_2 header4_1 257.5 245.0 -0.10
header4_2 257.5 240.0 -0.08
Issue #2 is tricky because Jupyter notebook displays the index with the "blank" values, but if you were to do df.index
, it would show that all the data is actually there. Its just a visual choice used by Jupyter notebooks.
In order to achieve this, you can check for value changes and join newly created df.
idx_df = df.index.to_frame().reset_index(drop=True)
df = idx_df.where(idx_df.ne(idx_df.shift())).join(df.reset_index(drop=True))
The creator of gspread-pandas has added the functionality to merge indexes when writing a dataframe to Google Sheets. It’s not yet in general release version of gspread-pandas, but can be found here: https://github.com/aiguofer/gspread-pandas/pull/92