MultiIndex pandas dataframe and writing to Google Sheets using gspread-pandas

Question

Starting with the following dictionary:

test_dict = {'header1_1': {'header2_1': {'header3_1': {'header4_1': ['322.5', 330.0, -0.28],
    'header4_2': ['322.5', 332.5, -0.26]},
   'header3_2': {'header4_1': ['285.0', 277.5, -0.09],
    'header4_2': ['287.5', 277.5, -0.12]}},
  'header2_2': {'header3_1': {'header4_1': ['345.0', 357.5, -0.14],
    'header4_2': ['345.0', 362.5, -0.14]},
   'header3_2': {'header4_1': ['257.5', 245.0, -0.1],
    'header4_2': ['257.5', 240.0, -0.08]}}}}

I want the headers in the index, so I reform the dictionary:

reformed_dict = {}
for outerKey, innerDict in test_dict.items():
    for innerKey, innerDict2 in innerDict.items():
        for innerKey2, innerDict3 in innerDict2.items():
            for innerKey3, values in innerDict3.items():
                reformed_dict[(outerKey,
                        innerKey, innerKey2, innerKey3)] = values

And assign column names to the headers:

keys = reformed_dict.keys()
values = reformed_dict.values()
index = pd.MultiIndex.from_tuples(keys, names=["H1", "H2", "H3", "H4"])
df = pd.DataFrame(data=values, index=index)

That gets to a dataframe that looks like this:

Issue #1 [*** this has been answered by @AzharKhan, so feel free to skip ahead to Issue #2 ***]: To assign names to the data columns, I tried:

df.columns = ['col 1', 'col 2' 'col 3']

and got error: "ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements"

Then per a suggestion, I tried:

df = df.rename(columns={'0': 'Col1', '1': 'Col2', '2': 'Col3'})

This does not generate an error, but the dataframe looks exactly the same as before, with 0, 1, 2 as the data column headers.

How can I assign names to these data columns? I assume 0, 1, 2 are column indices, not column names.

Issue #2: When I write this dataframe to Google Sheets using gspread-pandas:

s.open_sheet('test')
Spread.df_to_sheet(s, df, index=True, headers=True, start='A8', replace=False)

The result is this:

What I would like is this:

This is how the dataframe appears in Jupyter notebook screenshot earlier, so it seems the process of writing to spreadsheet is filling in the empty row headers, which makes the table harder to read at a glance.

How can I get the output to spreadsheet to omit the row headers until they have changed, and thus get the second spreadsheet output?

Asked By: jub

||

Source

Answer 1

Issue #1

Your columns are numbers (not strings). You can see it by:

print(df.columns)

[Out]:
RangeIndex(start=0, stop=3, step=1)

Use numbers in df.rename() as follows:

df = df.rename(columns={0: 'Col1', 1: 'Col2', 2: 'Col3'})
print(df.columns)
print(df)

[Out]:
Index(['Col1', 'Col2', 'Col3'], dtype='object')

                                          Col1   Col2  Col3
H1        H2        H3        H4                           
header1_1 header2_1 header3_1 header4_1  322.5  330.0 -0.28
                              header4_2  322.5  332.5 -0.26
                    header3_2 header4_1  285.0  277.5 -0.09
                              header4_2  287.5  277.5 -0.12
          header2_2 header3_1 header4_1  345.0  357.5 -0.14
                              header4_2  345.0  362.5 -0.14
                    header3_2 header4_1  257.5  245.0 -0.10
                              header4_2  257.5  240.0 -0.08

Or if you want to generalise it rather than hard coding then use:

df = df.rename(columns={i:f"Col{i+1}" for i in df.columns})

I am not sure about your issue #2. You may want to carve it out into a separate question to get attention.

Answered By: Azhar Khan

Answer 2

Here is a way to handle issue #1 by using pd.json_normalize()

df = pd.json_normalize(test_dict,max_level=3).stack().droplevel(0)
idx = df.index.map(lambda x: tuple(x.split('.'))).rename(['H1','H2','H3','H4'])
df = pd.DataFrame(df.tolist(),index = idx,columns = ['col1','col2','col3'])

Output:

                                          col1   col2  col3
H1        H2        H3        H4                           
header1_1 header2_1 header3_1 header4_1  322.5  330.0 -0.28
                              header4_2  322.5  332.5 -0.26
                    header3_2 header4_1  285.0  277.5 -0.09
                              header4_2  287.5  277.5 -0.12
          header2_2 header3_1 header4_1  345.0  357.5 -0.14
                              header4_2  345.0  362.5 -0.14
                    header3_2 header4_1  257.5  245.0 -0.10
                              header4_2  257.5  240.0 -0.08

Issue #2 is tricky because Jupyter notebook displays the index with the "blank" values, but if you were to do df.index, it would show that all the data is actually there. Its just a visual choice used by Jupyter notebooks.

In order to achieve this, you can check for value changes and join newly created df.

idx_df = df.index.to_frame().reset_index(drop=True)

df = idx_df.where(idx_df.ne(idx_df.shift())).join(df.reset_index(drop=True))

Answered By: rhug123

Answer 3

The creator of gspread-pandas has added the functionality to merge indexes when writing a dataframe to Google Sheets. It’s not yet in general release version of gspread-pandas, but can be found here: https://github.com/aiguofer/gspread-pandas/pull/92

Answered By: jub

MultiIndex pandas dataframe and writing to Google Sheets using gspread-pandas

Question:

Answers: