How to use converter in Pandas when using a MultiIndex
Question:
The issue
I have an excel table where the first row is a header and the second row is the unit of measurement for the rest of that column (i.e. nanometers, microns). Pandas provides an excellent read_excel function where I can pass a dictionary of converters. The key of the dictionary is the column name and the value is a lambda function that converts the excel value to some other value I want. In this case, the base value of whatever metric I’m using (nanometers to meters).
I cannot seem to figure out how to get my converter dictionary to use the second header row (the unit of measurement row). If I only specify my headers to take the unit row it works but I want the actual labels to be included in my header.
Here is my code
import numpy as np
import pandas as pd
import re
import os
from typing import Dict
from pandas.core.frame import DataFrame
Converters = {
"GPa": lambda gpa: gpa * 1_000_000_000,
"nm": lambda nm: nm / 1_000_000_000,
"microns": lambda microns: microns / 1_000_000
}
# Read and load metadata
directory = data_directory + "/" + metadata_directory
filenames = sorted(os.listdir(directory))
for filename in filenames:
readData = pd.read_excel("./" + directory + "/" + filename, header=[0,1], converters=Converters)
print(filename, "n", readData.head(2))
OS Specs
Device name DESKTOP-AE4IMFH
Processor Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz 1.50 GHz
Installed RAM 12.0 GB (11.8 GB usable)
Device ID 2B55F49B-6877-455D-88C5-D369A23FB40C
Product ID 00325-96685-10579-AAOEM
System type 64-bit operating system, x64-based processor
Pen and touch Pen and touch support with 10 touch points
Edition Windows 10 Home
Version 20H2
Installed on 7/23/2020
OS build 19042.1052
Experience Windows Feature Experience Pack 120.2212.2020.0
Python Version 3.9.5
What I’ve tried
Getting rid of the MultiIndex and just specifying the header as row 1 works great. However, I really want to have the column names as part of the header.
One thought was maybe to convert the DataFrame as a numpy array and then find the column index that matched each Converter function name. Then we could apply the conversion manually to each row at that column index. However, this feels hacky and would love to find a cleaner solution
Answers:
I’m not sure I completely understand what you’re trying to do. Nevertheless,
here’s a suggestion:
In the following I’m using as an example an Excel-file test.xlsx
with the content
col_1 col_2 col_3
1 2 3
1 1 1
2 2 2
3 3 3
This
from operator import mul
from functools import partial
units = pd.read_excel('test.xlsx', nrows=1)
converters = {
col: partial(mul, 1 / units.at[0, col])
for col in units.columns
}
df = pd.read_excel('test.xlsx', skiprows=[1], converters=converters)
produces the following dataframe df
:
col_1 col_2 col_3
0 1.0 0.5 0.333333
1 2.0 1.0 0.666667
2 3.0 1.5 1.000000
Here the row which contains the units isn’t included. If you want to include it then replace the last line with:
df = pd.concat([
units,
pd.read_excel('test.xlsx', skiprows=[1], converters=converters)
]).reset_index(drop=True)
Result:
col_1 col_2 col_3
0 1.0 2.0 3.000000
1 1.0 0.5 0.333333
2 2.0 1.0 0.666667
3 3.0 1.5 1.000000
(If you’re wondering why I haven’t used lambdas for the definition of the converters: This usually fails if you’re defining them via variables.)
So, if you want to integrate that into your code it would look like:
from operator import mul
from functools import partial
...
for filename in filenames:
filepath = "./" + directory + "/" + filename
units = pd.read_excel(filepath, nrows=1)
converters = {
col: partial(mul, 1 / units.at[0, col])
for col in units.columns
}
readData = pd.read_excel(filepath, skiprows=[1], converters=converters)
EDIT: After rethinking the question today I realized that the use of converters is probably not the best approach here. The converter functions are so basic (simple division) that there’s a better solution available:
for filename in filenames:
readData = pd.read_excel("./" + directory + "/" + filename)
# Version 1: Discarding row with units
readData = (readData.iloc[1:, :] / readData.iloc[0, :]).reset_index(drop=True)
# Version 2: Keeping row with units
readData.iloc[1:, :] /= readData.iloc[0, :]
I just came across this question, because I had pretty much the same problem. While @Timus’ answer actually solves the problem at hand, I thought I would still share the solution I came up with, because it actually uses the converters
arguement of the read_excel
for a MultiIndex dataframe.
Suppose we have the following table (in excel):
width | height |
nm | mm |
----------------
1 | 4 |
2 | 5 |
3 | 6 |
The first row is something that was measured, the second row states the unit. All the following rows are the measured data.
Now, to read in the excel file into a Pandas DataFrame and convert the measurement data into meters, you can do the following:
import pandas as pd
converters = {
("width", "nm"): lambda nm: nm / 1_000_000_000,
("height", "mm"): lambda mm: mm / 1_000,
}
data = pd.read_excel("PATH/TO/EXCEL/FILE", header=[0, 1], converters=converters)
print(data)
The key point here is that a tuple is used to address the columns to which the converters are applied (e.g. ("width", "nm")
).
The result looks like this:
width height
nm mm
0 1.000000e-09 0.004
1 2.000000e-09 0.005
2 3.000000e-09 0.006
Of course the units in the DataFrame are not correct anymore. To remove them, you can add the following line to the script:
data.columns = data.columns.droplevel(1)
Afterwards print
outputs:
width height
0 1.000000e-09 0.004
1 2.000000e-09 0.005
2 3.000000e-09 0.006
The issue
I have an excel table where the first row is a header and the second row is the unit of measurement for the rest of that column (i.e. nanometers, microns). Pandas provides an excellent read_excel function where I can pass a dictionary of converters. The key of the dictionary is the column name and the value is a lambda function that converts the excel value to some other value I want. In this case, the base value of whatever metric I’m using (nanometers to meters).
I cannot seem to figure out how to get my converter dictionary to use the second header row (the unit of measurement row). If I only specify my headers to take the unit row it works but I want the actual labels to be included in my header.
Here is my code
import numpy as np
import pandas as pd
import re
import os
from typing import Dict
from pandas.core.frame import DataFrame
Converters = {
"GPa": lambda gpa: gpa * 1_000_000_000,
"nm": lambda nm: nm / 1_000_000_000,
"microns": lambda microns: microns / 1_000_000
}
# Read and load metadata
directory = data_directory + "/" + metadata_directory
filenames = sorted(os.listdir(directory))
for filename in filenames:
readData = pd.read_excel("./" + directory + "/" + filename, header=[0,1], converters=Converters)
print(filename, "n", readData.head(2))
OS Specs
Device name DESKTOP-AE4IMFH
Processor Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz 1.50 GHz
Installed RAM 12.0 GB (11.8 GB usable)
Device ID 2B55F49B-6877-455D-88C5-D369A23FB40C
Product ID 00325-96685-10579-AAOEM
System type 64-bit operating system, x64-based processor
Pen and touch Pen and touch support with 10 touch points
Edition Windows 10 Home
Version 20H2
Installed on 7/23/2020
OS build 19042.1052
Experience Windows Feature Experience Pack 120.2212.2020.0
Python Version 3.9.5
What I’ve tried
Getting rid of the MultiIndex and just specifying the header as row 1 works great. However, I really want to have the column names as part of the header.
One thought was maybe to convert the DataFrame as a numpy array and then find the column index that matched each Converter function name. Then we could apply the conversion manually to each row at that column index. However, this feels hacky and would love to find a cleaner solution
I’m not sure I completely understand what you’re trying to do. Nevertheless,
here’s a suggestion:
In the following I’m using as an example an Excel-file test.xlsx
with the content
col_1 col_2 col_3
1 2 3
1 1 1
2 2 2
3 3 3
This
from operator import mul
from functools import partial
units = pd.read_excel('test.xlsx', nrows=1)
converters = {
col: partial(mul, 1 / units.at[0, col])
for col in units.columns
}
df = pd.read_excel('test.xlsx', skiprows=[1], converters=converters)
produces the following dataframe df
:
col_1 col_2 col_3
0 1.0 0.5 0.333333
1 2.0 1.0 0.666667
2 3.0 1.5 1.000000
Here the row which contains the units isn’t included. If you want to include it then replace the last line with:
df = pd.concat([
units,
pd.read_excel('test.xlsx', skiprows=[1], converters=converters)
]).reset_index(drop=True)
Result:
col_1 col_2 col_3
0 1.0 2.0 3.000000
1 1.0 0.5 0.333333
2 2.0 1.0 0.666667
3 3.0 1.5 1.000000
(If you’re wondering why I haven’t used lambdas for the definition of the converters: This usually fails if you’re defining them via variables.)
So, if you want to integrate that into your code it would look like:
from operator import mul
from functools import partial
...
for filename in filenames:
filepath = "./" + directory + "/" + filename
units = pd.read_excel(filepath, nrows=1)
converters = {
col: partial(mul, 1 / units.at[0, col])
for col in units.columns
}
readData = pd.read_excel(filepath, skiprows=[1], converters=converters)
EDIT: After rethinking the question today I realized that the use of converters is probably not the best approach here. The converter functions are so basic (simple division) that there’s a better solution available:
for filename in filenames:
readData = pd.read_excel("./" + directory + "/" + filename)
# Version 1: Discarding row with units
readData = (readData.iloc[1:, :] / readData.iloc[0, :]).reset_index(drop=True)
# Version 2: Keeping row with units
readData.iloc[1:, :] /= readData.iloc[0, :]
I just came across this question, because I had pretty much the same problem. While @Timus’ answer actually solves the problem at hand, I thought I would still share the solution I came up with, because it actually uses the converters
arguement of the read_excel
for a MultiIndex dataframe.
Suppose we have the following table (in excel):
width | height |
nm | mm |
----------------
1 | 4 |
2 | 5 |
3 | 6 |
The first row is something that was measured, the second row states the unit. All the following rows are the measured data.
Now, to read in the excel file into a Pandas DataFrame and convert the measurement data into meters, you can do the following:
import pandas as pd
converters = {
("width", "nm"): lambda nm: nm / 1_000_000_000,
("height", "mm"): lambda mm: mm / 1_000,
}
data = pd.read_excel("PATH/TO/EXCEL/FILE", header=[0, 1], converters=converters)
print(data)
The key point here is that a tuple is used to address the columns to which the converters are applied (e.g. ("width", "nm")
).
The result looks like this:
width height
nm mm
0 1.000000e-09 0.004
1 2.000000e-09 0.005
2 3.000000e-09 0.006
Of course the units in the DataFrame are not correct anymore. To remove them, you can add the following line to the script:
data.columns = data.columns.droplevel(1)
Afterwards print
outputs:
width height
0 1.000000e-09 0.004
1 2.000000e-09 0.005
2 3.000000e-09 0.006