Applying `pd.qcut` on multiple columns

Question

I have a DataFrame containing 2 columns x and y that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut() but as far as I can tell it can be applied only to 1 column.

For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4×4).

I tried a very direct approach which obviously didn’t produce the right result (look at 51 and 99 for example). Here is the code:

df['x_bin']=pd.qcut(df.x,4)                                                
df['y_bin']=pd.qcut(df.y,4)                                                
grouped=df.groupby([df.x_bin,df.y_bin]).count()                                       
print(grouped)

The output:

x_bin                      y_bin                                 
(7.976999999999999, 7.984] (-219.17600000000002, -219.17]  51  51
                           (-219.17, -219.167]             60  60
                           (-219.167, -219.16]             64  64
                           (-219.16, -219.154]             99  99
(7.984, 7.986]             (-219.17600000000002, -219.17]  76  76
                           (-219.17, -219.167]             81  81
                           (-219.167, -219.16]             63  63
                           (-219.16, -219.154]             53  53
(7.986, 7.989]             (-219.17600000000002, -219.17]  78  78
                           (-219.17, -219.167]             77  77
                           (-219.167, -219.16]             68  68
                           (-219.16, -219.154]             51  51
(7.989, 7.993]             (-219.17600000000002, -219.17]  70  70
                           (-219.17, -219.167]             55  55
                           (-219.167, -219.16]             77  77
                           (-219.16, -219.154]             71  71

Am I making a mistake in thinking it is possible to do with pandas only or am I missing something else?

Asked By: Alex.Kh

||

Source

Answer 1

The problem is that the distribution of the rows might not be the same according to x than according to y.

You are empirically mimicking a correlation analysis and finding out that there is slight negative relation… the y values are higher in the lower end of x scale and rather flat on the higher end of x.

So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)

Edit

Something like:

split_df = [(xbin,  xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x

split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
    for xbin, xdf in split_df] # now each xdf is evenly cut on y

Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64 on both xbins and ybins.

Answered By: RichieV

Applying `pd.qcut` on multiple columns

Question:

Answers: