Pandas extensive 'describe' include count the null values
Question:
I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :
- 73 float columns
- 30 columns dates
- remainder columns in object
I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
– dtype
– count
– count null values
– % number of null values
– max
– min
– 50%
– 75%
– 25%
– ……
For now, i have juste a basic function that describe my data like this :
Dataframe.describe(include = 'all')
Do you have a function or method to do this more extensive descrition.
Thanks.
Answers:
You need write custom functions for Series
and then add to final describe DataFrame
:
Notice:
First row of final df is count
– used function count
for count non NaNs values
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,np.nan,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7 1 5 a
1 b NaN 8 3 3 a
2 c NaN 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f 4.0 3 0 4 b
df1 = df.describe(include = 'all')
df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()
print (df1)
A B C D E F
count 6 4 6 6 6 6
unique 6 NaN NaN NaN NaN 2
top e NaN NaN NaN NaN b
freq 1 NaN NaN NaN NaN 3
mean NaN 4.5 5.5 2.83333 4.83333 NaN
std NaN 0.57735 2.88097 2.71416 2.48328 NaN
min NaN 4 2 0 2 NaN
25% NaN 4 3.25 1 3.25 NaN
50% NaN 4.5 5.5 2 4.5 NaN
75% NaN 5 7.75 4.5 5.75 NaN
max NaN 5 9 7 9 NaN
dtype object float64 int64 int64 int64 object
size 6 6 6 6 6 6
% count 0 0.333333 0 0 0 0
In pandas, there is no alternative function to describe()
, but it clearly isn’t displaying all the values that you need. You can use various parameters of the describe()
function accordingly.
describe()
on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn’t show up in describe()
, change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map()
function.
describe()
on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).
To call describe()
on just the object
s (strings) use describe(include = ['O'])
.
I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :
- 73 float columns
- 30 columns dates
- remainder columns in object
I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
– dtype
– count
– count null values
– % number of null values
– max
– min
– 50%
– 75%
– 25%
– ……
For now, i have juste a basic function that describe my data like this :
Dataframe.describe(include = 'all')
Do you have a function or method to do this more extensive descrition.
Thanks.
You need write custom functions for Series
and then add to final describe DataFrame
:
Notice:
First row of final df is count
– used function count
for count non NaNs values
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,np.nan,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7 1 5 a
1 b NaN 8 3 3 a
2 c NaN 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f 4.0 3 0 4 b
df1 = df.describe(include = 'all')
df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()
print (df1)
A B C D E F
count 6 4 6 6 6 6
unique 6 NaN NaN NaN NaN 2
top e NaN NaN NaN NaN b
freq 1 NaN NaN NaN NaN 3
mean NaN 4.5 5.5 2.83333 4.83333 NaN
std NaN 0.57735 2.88097 2.71416 2.48328 NaN
min NaN 4 2 0 2 NaN
25% NaN 4 3.25 1 3.25 NaN
50% NaN 4.5 5.5 2 4.5 NaN
75% NaN 5 7.75 4.5 5.75 NaN
max NaN 5 9 7 9 NaN
dtype object float64 int64 int64 int64 object
size 6 6 6 6 6 6
% count 0 0.333333 0 0 0 0
In pandas, there is no alternative function to describe()
, but it clearly isn’t displaying all the values that you need. You can use various parameters of the describe()
function accordingly.
describe()
on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn’t show up in describe()
, change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map()
function.
describe()
on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).
To call describe()
on just the object
s (strings) use describe(include = ['O'])
.