How to get the first value from pandas value_counts()

Question:

I am writing a program to discretize a set of attributes via entropy discretization. The goal is to parse the dataset

A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2

Into

A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2

The specific problem that I am facing with my program is determining the number of classes in my dataset. This takes place at numberOfClasses = s['Class'].value_counts(). I would like to use a pandas method to return the number of distinct classes. In this example there are only two. However I get back

Number of classes: 2    5
1    4

From the print statement.

import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2

def main():
    df = pd.read_csv('S1.csv')
    s = df
    s = entropy_discretization(s)

# This method discretizes s A1
# If the information gain is 0, i.e the number of 
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):

    informationGain = {}
    # while(uniqueValue(s)):
    # Step 1: pick a threshold
    threshold = 6

    # Step 2: Partititon the data set into two parttitions
    s1 = s[s['A'] < threshold]
    print("s1 after spitting")
    print(s1)
    print("******************")
    s2 = s[s['A'] >= threshold]
    print("s2 after spitting")
    print(s2)
    print("******************")
        
    # Step 3: calculate the information gain.
    informationGain = information_gain(s1,s2,s)

    print(informationGain)

    # # Step 5: calculate the max information gain
    # minInformationGain = min(informationGain)

    # # Step 6: keep the partitions of S based on the value of threshold_i
    # s = bestPartition(minInformationGain, s)

def uniqueValue(s):
    # are records in s the same? return true
    if s.nunique()['A'] == 1:
        return False
    # otherwise false 
    else:
        return True

def bestPartition(maxInformationGain):
    # determine be threshold_i
    threshold_i = 6

    return 


def information_gain(s1, s2, s):
    # calculate cardinality for s1
    cardinalityS1 = len(pd.Index(s1['A']).value_counts())
    print(f'The Cardinality of s1 is: {cardinalityS1}')
    # calculate cardinality for s2
    cardinalityS2 = len(pd.Index(s2['A']).value_counts())
    print(f'The Cardinality of s2 is: {cardinalityS2}')
    # calculate cardinality of s
    cardinalityS = len(pd.Index(s['A']).value_counts())
    print(f'The Cardinality of s is: {cardinalityS}')
    # calculate informationGain
    informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
    print(f'The total informationGain is: {informationGain}')
    return informationGain



def entropy(s):
    # calculate the number of classes in s
    numberOfClasses = s['Class'].value_counts()
    print(f'Number of classes: {numberOfClasses}')
    # TODO calculate pi for each class.
    # calculate the frequency of class_i in S1
    p1 = 2/4
    p2 = 3/4
    ent = -(p1*log2(p2)) - (p2*log2(p2))

    return ent 

main()

Ideally, I’d like to print Number of classes: 2. This way I can loop over the classes and calculate the frequencies for the attribute A from the dataset. I’ve reviewed the pandas documentation, but I got stuck at value_counts(). What can I try next?

Asked By: Evan Gertis

||

Answers:

Maybe try:

number_of_classes = len(s['Class'].unique())

which will return the number of unique classes in the column Class.

Or even shorter:

s['Class'].nunique()
Answered By: AloneTogether
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.