Pandas extensive 'describe' include count the null values

Question:

I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :

  • 73 float columns
  • 30 columns dates
  • remainder columns in object

I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
– dtype
– count
– count null values
– % number of null values
– max
– min
– 50%
– 75%
– 25%
– ……

For now, i have juste a basic function that describe my data like this :

Dataframe.describe(include = 'all')

Do you have a function or method to do this more extensive descrition.

Thanks.

Asked By: Ib D

||

Answers:

You need write custom functions for Series and then add to final describe DataFrame:

Notice:

First row of final df is count – used function count for count non NaNs values

df = pd.DataFrame({
        'A':list('abcdef'),
         'B':[4,np.nan,np.nan,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,5,7,1,0],
         'E':[5,3,6,9,2,4],
         'F':list('aaabbb')
})

print (df)
   A    B  C  D  E  F
0  a  4.0  7  1  5  a
1  b  NaN  8  3  3  a
2  c  NaN  9  5  6  a
3  d  5.0  4  7  9  b
4  e  5.0  2  1  2  b
5  f  4.0  3  0  4  b

df1 = df.describe(include = 'all')

df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()

print (df1)
              A         B        C        D        E       F
count         6         4        6        6        6       6
unique        6       NaN      NaN      NaN      NaN       2
top           e       NaN      NaN      NaN      NaN       b
freq          1       NaN      NaN      NaN      NaN       3
mean        NaN       4.5      5.5  2.83333  4.83333     NaN
std         NaN   0.57735  2.88097  2.71416  2.48328     NaN
min         NaN         4        2        0        2     NaN
25%         NaN         4     3.25        1     3.25     NaN
50%         NaN       4.5      5.5        2      4.5     NaN
75%         NaN         5     7.75      4.5     5.75     NaN
max         NaN         5        9        7        9     NaN
dtype    object   float64    int64    int64    int64  object
size          6         6        6        6        6       6
% count       0  0.333333        0        0        0       0
Answered By: jezrael

In pandas, there is no alternative function to describe(), but it clearly isn’t displaying all the values that you need. You can use various parameters of the describe() function accordingly.

describe() on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn’t show up in describe(), change the type with:

df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)

You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map() function.

describe() on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).

To call describe() on just the objects (strings) use describe(include = ['O']).

Answered By: K. Aslam
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.