Python looping error counter
Question:
I’m trying to drop columns in my DataFrame and I would like to ask why can’t i iterate a series in my function. Here is my code
def checkDropVariance(df, column):
percentage = df.groupby(column).size().sort_values(ascending=False)/len(df) * 100
mean = percentage.mean()
N = len(percentage)
variance = 0
for i in range(N):
variance = variance + ((percentage[i]) - mean) ** 2
variance = variance/N
if variance > 10:
df = dropCol(df, column)
return df
However outside the function, if I do something like:
percentage = df.groupby('grade').size().sort_values(ascending=False)/len(df) * 100
percentage
percentage[2]
I get
grade
B 28.822392
C 27.705086
A 16.809648
D 15.621800
E 8.012288
F 2.412106
G 0.616680
dtype: float64
16.809648424166571
The KeyError returns 0
I found that if I change the i in percentage[i] to 5, i got KeyError returns 5
Here is the error code:
KeyError Traceback (most recent call last)
<ipython-input-33-2e9f3e36e2d6> in <module>()
1 for i in df.columns.values:
----> 2 df = checkDropVariance(df, i)
<ipython-input-32-126f83f240cc> in checkDropVariance(df, column)
5 variance = 0
6 for i in range(N):
----> 7 variance = variance + ((percentage[i]) - mean) ** 2
8 variance = variance/N
9 if variance > 10:
/home/atmaja/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/home/atmaja/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Thank you for your time
Answers:
KeyError is coming from pandas
because percentage[i]
attemps to access column “i” (not column at position i). If you want to access the i’th column, you would want to use .iloc
as detailed in the docs. That is, unless your columns are named as integers, in which case it should work.
Ie,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(3,3))
print(df)
print(df[2])
Gives
0 1 2
0 0.727617 0.920699 0.916352
1 0.985916 0.405609 0.123758
2 0.230229 0.981319 0.182571
0 0.916352
1 0.123758
2 0.182571
But running that code with df = pd.DataFrame(np.random.rand(3,3),columns=['A','B','C'])
will yield a KeyError.
As you can see from the stack trace that the error is occuring on the line:
variance = variance + ((percentage[i]) - mean) ** 2
This is because percentage[i]
is pandas way of saying: give me the data from percentage
dataframe column called i
. But i
is 0
in this case, and your dataframe called percentage
has no column called 0
. So you’re getting a KeyError.
It seems you are not quite grasping how to use pandas. Pandas has a built-in variance function. You could make a function that drops columns with variance higher than 10 like this:
def checkDropVariance(df, column):
# get the variance of column data
v = df[column].var()
# drop the column if the variance is higher than 10
if v > 10:
df = df.drop(column, axis=1)
return df
The Pandas documentation are great, I would recommend reading through them.
I’m trying to drop columns in my DataFrame and I would like to ask why can’t i iterate a series in my function. Here is my code
def checkDropVariance(df, column):
percentage = df.groupby(column).size().sort_values(ascending=False)/len(df) * 100
mean = percentage.mean()
N = len(percentage)
variance = 0
for i in range(N):
variance = variance + ((percentage[i]) - mean) ** 2
variance = variance/N
if variance > 10:
df = dropCol(df, column)
return df
However outside the function, if I do something like:
percentage = df.groupby('grade').size().sort_values(ascending=False)/len(df) * 100
percentage
percentage[2]
I get
grade
B 28.822392
C 27.705086
A 16.809648
D 15.621800
E 8.012288
F 2.412106
G 0.616680
dtype: float64
16.809648424166571
The KeyError returns 0
I found that if I change the i in percentage[i] to 5, i got KeyError returns 5
Here is the error code:
KeyError Traceback (most recent call last)
<ipython-input-33-2e9f3e36e2d6> in <module>()
1 for i in df.columns.values:
----> 2 df = checkDropVariance(df, i)
<ipython-input-32-126f83f240cc> in checkDropVariance(df, column)
5 variance = 0
6 for i in range(N):
----> 7 variance = variance + ((percentage[i]) - mean) ** 2
8 variance = variance/N
9 if variance > 10:
/home/atmaja/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/home/atmaja/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Thank you for your time
KeyError is coming from pandas
because percentage[i]
attemps to access column “i” (not column at position i). If you want to access the i’th column, you would want to use .iloc
as detailed in the docs. That is, unless your columns are named as integers, in which case it should work.
Ie,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(3,3))
print(df)
print(df[2])
Gives
0 1 2
0 0.727617 0.920699 0.916352
1 0.985916 0.405609 0.123758
2 0.230229 0.981319 0.182571
0 0.916352
1 0.123758
2 0.182571
But running that code with df = pd.DataFrame(np.random.rand(3,3),columns=['A','B','C'])
will yield a KeyError.
As you can see from the stack trace that the error is occuring on the line:
variance = variance + ((percentage[i]) - mean) ** 2
This is because percentage[i]
is pandas way of saying: give me the data from percentage
dataframe column called i
. But i
is 0
in this case, and your dataframe called percentage
has no column called 0
. So you’re getting a KeyError.
It seems you are not quite grasping how to use pandas. Pandas has a built-in variance function. You could make a function that drops columns with variance higher than 10 like this:
def checkDropVariance(df, column):
# get the variance of column data
v = df[column].var()
# drop the column if the variance is higher than 10
if v > 10:
df = df.drop(column, axis=1)
return df
The Pandas documentation are great, I would recommend reading through them.