Python looping error counter

Question:

I’m trying to drop columns in my DataFrame and I would like to ask why can’t i iterate a series in my function. Here is my code

def checkDropVariance(df, column):
    percentage = df.groupby(column).size().sort_values(ascending=False)/len(df) * 100
    mean = percentage.mean()
    N = len(percentage)
    variance = 0
    for i in range(N):
        variance = variance + ((percentage[i]) - mean) ** 2
    variance = variance/N
    if variance > 10:
        df = dropCol(df, column)
    return df

However outside the function, if I do something like:

percentage = df.groupby('grade').size().sort_values(ascending=False)/len(df) * 100
percentage
percentage[2]

I get

grade
B    28.822392
C    27.705086
A    16.809648
D    15.621800
E     8.012288
F     2.412106
G     0.616680
dtype: float64

16.809648424166571

The KeyError returns 0
I found that if I change the i in percentage[i] to 5, i got KeyError returns 5
Here is the error code:

KeyError                                  Traceback (most recent call last)
<ipython-input-33-2e9f3e36e2d6> in <module>()
      1 for i in df.columns.values:
----> 2     df = checkDropVariance(df, i)

<ipython-input-32-126f83f240cc> in checkDropVariance(df, column)
      5     variance = 0
      6     for i in range(N):
----> 7         variance = variance + ((percentage[i]) - mean) ** 2
      8     variance = variance/N
      9     if variance > 10:

/home/atmaja/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    599         key = com._apply_if_callable(key, self)
    600         try:
--> 601             result = self.index.get_value(self, key)
    602 
    603             if not is_scalar(result):

/home/atmaja/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   2426         try:
   2427             return self._engine.get_value(s, k,
-> 2428                                           tz=getattr(series.dtype, 'tz', None))
   2429         except KeyError as e1:
   2430             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()

KeyError: 0

Thank you for your time

Asked By: Prajogo Atmaja

||

Answers:

KeyError is coming from pandas because percentage[i] attemps to access column “i” (not column at position i). If you want to access the i’th column, you would want to use .iloc as detailed in the docs. That is, unless your columns are named as integers, in which case it should work.

Ie,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(3,3))
print(df)

print(df[2])

Gives

          0         1         2
0  0.727617  0.920699  0.916352
1  0.985916  0.405609  0.123758
2  0.230229  0.981319  0.182571

0    0.916352
1    0.123758
2    0.182571

But running that code with df = pd.DataFrame(np.random.rand(3,3),columns=['A','B','C']) will yield a KeyError.

Answered By: Antoine Zambelli

As you can see from the stack trace that the error is occuring on the line:

variance = variance + ((percentage[i]) - mean) ** 2

This is because percentage[i] is pandas way of saying: give me the data from percentage dataframe column called i. But i is 0 in this case, and your dataframe called percentage has no column called 0. So you’re getting a KeyError.

It seems you are not quite grasping how to use pandas. Pandas has a built-in variance function. You could make a function that drops columns with variance higher than 10 like this:

def checkDropVariance(df, column):
    # get the variance of column data
    v = df[column].var()
    # drop the column if the variance is higher than 10
    if v > 10:
        df = df.drop(column, axis=1)
    return df

The Pandas documentation are great, I would recommend reading through them.

Answered By: jeffery_the_wind
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.