Applying `pd.qcut` on multiple columns
Question:
I have a DataFrame containing 2 columns x
and y
that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut()
but as far as I can tell it can be applied only to 1 column.
For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4×4).
I tried a very direct approach which obviously didn’t produce the right result (look at 51 and 99 for example). Here is the code:
df['x_bin']=pd.qcut(df.x,4)
df['y_bin']=pd.qcut(df.y,4)
grouped=df.groupby([df.x_bin,df.y_bin]).count()
print(grouped)
The output:
x_bin y_bin
(7.976999999999999, 7.984] (-219.17600000000002, -219.17] 51 51
(-219.17, -219.167] 60 60
(-219.167, -219.16] 64 64
(-219.16, -219.154] 99 99
(7.984, 7.986] (-219.17600000000002, -219.17] 76 76
(-219.17, -219.167] 81 81
(-219.167, -219.16] 63 63
(-219.16, -219.154] 53 53
(7.986, 7.989] (-219.17600000000002, -219.17] 78 78
(-219.17, -219.167] 77 77
(-219.167, -219.16] 68 68
(-219.16, -219.154] 51 51
(7.989, 7.993] (-219.17600000000002, -219.17] 70 70
(-219.17, -219.167] 55 55
(-219.167, -219.16] 77 77
(-219.16, -219.154] 71 71
Am I making a mistake in thinking it is possible to do with pandas
only or am I missing something else?
Answers:
The problem is that the distribution of the rows might not be the same according to x than according to y.
You are empirically mimicking a correlation analysis and finding out that there is slight negative relation… the y values are higher in the lower end of x scale and rather flat on the higher end of x.
So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)
Edit
Something like:
split_df = [(xbin, xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x
split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
for xbin, xdf in split_df] # now each xdf is evenly cut on y
Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64
on both xbins and ybins.
I have a DataFrame containing 2 columns x
and y
that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut()
but as far as I can tell it can be applied only to 1 column.
For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4×4).
I tried a very direct approach which obviously didn’t produce the right result (look at 51 and 99 for example). Here is the code:
df['x_bin']=pd.qcut(df.x,4)
df['y_bin']=pd.qcut(df.y,4)
grouped=df.groupby([df.x_bin,df.y_bin]).count()
print(grouped)
The output:
x_bin y_bin
(7.976999999999999, 7.984] (-219.17600000000002, -219.17] 51 51
(-219.17, -219.167] 60 60
(-219.167, -219.16] 64 64
(-219.16, -219.154] 99 99
(7.984, 7.986] (-219.17600000000002, -219.17] 76 76
(-219.17, -219.167] 81 81
(-219.167, -219.16] 63 63
(-219.16, -219.154] 53 53
(7.986, 7.989] (-219.17600000000002, -219.17] 78 78
(-219.17, -219.167] 77 77
(-219.167, -219.16] 68 68
(-219.16, -219.154] 51 51
(7.989, 7.993] (-219.17600000000002, -219.17] 70 70
(-219.17, -219.167] 55 55
(-219.167, -219.16] 77 77
(-219.16, -219.154] 71 71
Am I making a mistake in thinking it is possible to do with pandas
only or am I missing something else?
The problem is that the distribution of the rows might not be the same according to x than according to y.
You are empirically mimicking a correlation analysis and finding out that there is slight negative relation… the y values are higher in the lower end of x scale and rather flat on the higher end of x.
So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)
Edit
Something like:
split_df = [(xbin, xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x
split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
for xbin, xdf in split_df] # now each xdf is evenly cut on y
Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64
on both xbins and ybins.