How to use converter in Pandas when using a MultiIndex

Question:

The issue

I have an excel table where the first row is a header and the second row is the unit of measurement for the rest of that column (i.e. nanometers, microns). Pandas provides an excellent read_excel function where I can pass a dictionary of converters. The key of the dictionary is the column name and the value is a lambda function that converts the excel value to some other value I want. In this case, the base value of whatever metric I’m using (nanometers to meters).

I cannot seem to figure out how to get my converter dictionary to use the second header row (the unit of measurement row). If I only specify my headers to take the unit row it works but I want the actual labels to be included in my header.

Here is my code

import numpy as np
import pandas as pd
import re
import os
from typing import Dict
from pandas.core.frame import DataFrame

Converters = {
  "GPa": lambda gpa: gpa * 1_000_000_000,
  "nm": lambda nm: nm / 1_000_000_000,
  "microns": lambda microns: microns / 1_000_000 
}

# Read and load metadata
directory = data_directory + "/" + metadata_directory
filenames = sorted(os.listdir(directory))
for filename in filenames:
  readData = pd.read_excel("./" + directory + "/" + filename, header=[0,1], converters=Converters)
  print(filename, "n", readData.head(2))

OS Specs

Device name DESKTOP-AE4IMFH
Processor Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz 1.50 GHz
Installed RAM 12.0 GB (11.8 GB usable)
Device ID 2B55F49B-6877-455D-88C5-D369A23FB40C
Product ID 00325-96685-10579-AAOEM
System type 64-bit operating system, x64-based processor
Pen and touch Pen and touch support with 10 touch points

Edition Windows 10 Home
Version 20H2
Installed on ‎7/‎23/‎2020
OS build 19042.1052
Experience Windows Feature Experience Pack 120.2212.2020.0

Python Version 3.9.5

What I’ve tried

Getting rid of the MultiIndex and just specifying the header as row 1 works great. However, I really want to have the column names as part of the header.

One thought was maybe to convert the DataFrame as a numpy array and then find the column index that matched each Converter function name. Then we could apply the conversion manually to each row at that column index. However, this feels hacky and would love to find a cleaner solution

Asked By: Michael Black

||

Answers:

I’m not sure I completely understand what you’re trying to do. Nevertheless,
here’s a suggestion:

In the following I’m using as an example an Excel-file test.xlsx with the content

col_1  col_2  col_3
    1      2      3
    1      1      1
    2      2      2
    3      3      3

This

from operator import mul
from functools import partial

units = pd.read_excel('test.xlsx', nrows=1)
converters = {
    col: partial(mul, 1 / units.at[0, col])
    for col in units.columns
}
df = pd.read_excel('test.xlsx', skiprows=[1], converters=converters)

produces the following dataframe df:

   col_1  col_2     col_3
0    1.0    0.5  0.333333
1    2.0    1.0  0.666667
2    3.0    1.5  1.000000

Here the row which contains the units isn’t included. If you want to include it then replace the last line with:

df = pd.concat([
         units,
         pd.read_excel('test.xlsx', skiprows=[1], converters=converters)
     ]).reset_index(drop=True)

Result:

   col_1  col_2     col_3
0    1.0    2.0  3.000000
1    1.0    0.5  0.333333
2    2.0    1.0  0.666667
3    3.0    1.5  1.000000

(If you’re wondering why I haven’t used lambdas for the definition of the converters: This usually fails if you’re defining them via variables.)

So, if you want to integrate that into your code it would look like:

from operator import mul
from functools import partial

...

for filename in filenames:
    filepath = "./" + directory + "/" + filename
    units = pd.read_excel(filepath, nrows=1)
    converters = {
       col: partial(mul, 1 / units.at[0, col])
       for col in units.columns
    }
   readData = pd.read_excel(filepath, skiprows=[1], converters=converters)

EDIT: After rethinking the question today I realized that the use of converters is probably not the best approach here. The converter functions are so basic (simple division) that there’s a better solution available:

for filename in filenames:
   readData = pd.read_excel("./" + directory + "/" + filename)

   # Version 1: Discarding row with units
   readData = (readData.iloc[1:, :] / readData.iloc[0, :]).reset_index(drop=True)
   # Version 2: Keeping row with units
   readData.iloc[1:, :] /= readData.iloc[0, :]
Answered By: Timus

I just came across this question, because I had pretty much the same problem. While @Timus’ answer actually solves the problem at hand, I thought I would still share the solution I came up with, because it actually uses the converters arguement of the read_excel for a MultiIndex dataframe.

Suppose we have the following table (in excel):

width | height |
   nm |     mm |
----------------
    1 |      4 |
    2 |      5 |
    3 |      6 |

The first row is something that was measured, the second row states the unit. All the following rows are the measured data.

Now, to read in the excel file into a Pandas DataFrame and convert the measurement data into meters, you can do the following:

import pandas as pd

converters = {
    ("width", "nm"): lambda nm: nm / 1_000_000_000,
    ("height", "mm"): lambda mm: mm / 1_000,
}

data = pd.read_excel("PATH/TO/EXCEL/FILE", header=[0, 1], converters=converters)
print(data)

The key point here is that a tuple is used to address the columns to which the converters are applied (e.g. ("width", "nm")).

The result looks like this:

          width height
             nm     mm
0  1.000000e-09  0.004
1  2.000000e-09  0.005
2  3.000000e-09  0.006

Of course the units in the DataFrame are not correct anymore. To remove them, you can add the following line to the script:

data.columns = data.columns.droplevel(1)

Afterwards print outputs:

          width  height
0  1.000000e-09   0.004
1  2.000000e-09   0.005
2  3.000000e-09   0.006
Answered By: schuam