How is that pandas is faster than pure C in the groupby operation?
Question:
I have an nparray of x,y pairs with shape (n,2)
, and knowing for certain that for each x there are multiple values of y , I wanted to calculate the average value of y for each unique x. It occurred to me that this needed a groupby
operation followed by a mean
which was available in pandas library. However, thinking that pandas was slow because of the scale of my data (over a million points) I wrote a simple program in C and used ctypes to call the C function and perform the operation. I used -fPIC
and compiled a shared object
file with GCC MinGW
.
int average( int* array , int size_array , int* unique , int size_unique , float* avg ){
if (size_array % 2 != 0){
return 1;
}
for (int i = 0 ; i < size_unique ; i++){
int curX = unique[i];
int sum = 0;
int count = 0;
for (int j = 0 ; j < size_array ; j += 2){
if ( array[j] == curX ){
sum += (array[j+1]);
count += 1;
}
}
float average = ((float)sum / (float)count);
avg[i] = average;
}
return 0;
}
Later, because the program was still slow (took about 1.5 seconds) I gave pandas a shot and I was stunned at how faster it was using it. It was almost twice as fast as the program I wrote in C. But it didn’t make any sense to me. How did they achieve this level of performance? Is pandas using a hashtable?
ar = np.random.randint(0,2000,size = (40000,2))
df = pd.DataFrame({'x': ar[:,0], 'y': ar[:,1]})
df = df.groupby('y', as_index=False)['x'].mean()
x = df[['x']].to_numpy()
y = df[['y']].to_numpy()
I calculated and found out that for an array of size (40000,2)
and 2000
unique elements, I had about 80,000,000
operation which was done in less than 0.2s
. So each operation takes about 2.5 nanoseconds which is close to my processor’s limit (I have a 3.5Ghz Quad Core CPU – intel i7 4720HQ). So I’m pushing the CPU using the C code. How is it that pandas even push it further?
As I mentioned above, I compiled the C code with GCC MinGW and the following command:
gcc -fPIC -shared c_out.so c_in.c
.
I used pythons time library to estimate the runtime of the code. I initialized two timestamps one before (t1
) and one after the code (t2
). Afterward, I print the time difference between t1 and t2 as the runtime of the code.
Samples of the benchmark for 40000 data ranging from 0 to 2448 are as follows:
Run #1
Pandas: 0.0623
C: 0.2250
Run #2
Pandas: 0.0660
C: 0.1880
Run #3
Pandas: 0.609
C: 0.2261
Run #4
Pandas: 0.0629
C: 0.2488
Run #5
Pandas: 0.0619
C: 0.2159
Answers:
I don’t know pandas but I’m pretty sure that the developers have spend quite some time optimizing the implementation. There are several ways to optimize code. Selecting a good algorithm for the job is one.
Your C code implementation isn’t optimized. In contrary it’s a very simple brute force implementation that iterates the main array again and again.
So you are comparing things that can’t really be compared, i.e. a simple C implementation and a most likely optimized pandas implementation. It’s no wonder that the pandas implementation wins.
Also it’s important to compile your C code with optimization (typically -O2). It’s not clear whether you did that.
I tested your implementation of average
on my server and compiled it using gcc -Wall -Wextra -Werror prog.c
. The result was:
Using average, time: 0.2207540000000000
Then I turned on optimization, i.e. gcc -Wall -Wextra -Werror -O2 prog.c
. The result was:
Using average, time: 0.0833040000000000
Nearly 3 times faster – just by using -O2
optimization. The lesson is: When measuring performance make sure to turn on a proper optimization level.
Next step… can your implementation be improved?
Here is a first change:
struct accu
{
int sum;
int count;
};
int average_v1( int* array , int size_array , int* unique , int size_unique , float* avg )
{
if (size_array % 2 != 0)
{
return 1;
}
struct accu * accu = calloc(size_unique, sizeof *accu);
assert(accu != NULL);
for (int j = 0 ; j < size_array ; j += 2)
{
int flag = 0;
for (int i = 0 ; i < size_unique ; i++)
{
if ( array[j] == unique[i])
{
accu[i].sum += array[j+1];
++accu[i].count;
flag = 1;
break;
}
}
if (flag == 0)
{
puts("Invalid input datan");
exit(1);
}
}
for (int i = 0 ; i < size_unique ; i++)
{
avg[i] = ((float)accu[i].sum / (float)accu[i].count);
}
free(accu);
return 0;
}
The idea is pretty simple:
-
Swap the inner and outer loops, and break out of inner loop when value is matched. On average this will reduce the number of loops to the half.
-
Use dynamic allocated memory to hold intermediate results.
Again using gcc -Wall -Wextra -Werror -O2 prog.c
and running both your average
and my average_v1
on the same data set gives me:
Using average, time: 0.0833580000000000
Using average_v1, time: 0.0280810000000000
So average_v1
is nearly 3 times faster!
But can we do even better?
Yes, here is a little more complex implementation that uses binary search to find elements in unique
. Binary search requires a sorted array so a copy of unique
is made and qsort
is used. The original array index is also saved in order to get the exact same output.
struct sorted_unique
{
int value;
int org_index;
};
int comp(const void * a, const void *b)
{
struct sorted_unique * pa = (struct sorted_unique *)a;
struct sorted_unique * pb = (struct sorted_unique *)b;
if (pa->value > pb->value) return 1;
if (pa->value < pb->value) return -1;
return 0;
}
int average_v2( int* array , int size_array , int* unique , int size_unique , float* avg )
{
if (size_array % 2 != 0)
{
return 1;
}
struct sorted_unique * psu = malloc(size_unique * sizeof *psu);
assert(psu != NULL);
for (int i = 0; i < size_unique; ++i)
{
psu[i].value = unique[i];
psu[i].org_index = i;
}
qsort(psu, size_unique, sizeof *psu, comp);
struct accu * accu = calloc(size_unique, sizeof *accu);
assert(accu != NULL);
for (int j = 0 ; j < size_array ; j += 2)
{
int low = 0;
int high = size_unique-1;
int mid;
int v = array[j];
// Binary search
while(1)
{
if (low <= high)
{
mid = low + (high - low)/2;
if (psu[mid].value == v) break;
if (v < psu[mid].value)
{
high = mid-1;
}
else
{
low = mid + 1;
}
}
else
{
puts("Invalid input datan");
exit(1);
}
}
accu[mid].sum += array[j+1];
++accu[mid].count;
}
for (int i = 0 ; i < size_unique ; i++)
{
avg[psu[i].org_index] = ((float)accu[i].sum / (float)accu[i].count);
}
free(accu);
free(psu);
return 0;
}
Now I get:
Using average, time: 0.0833030000000000
Using average_v1, time: 0.0279690000000000
Using average_v2, time: 0.0032790000000000
So average_v2
is 25 times faster than average
.
But can we do even better?
Probably… But I’ll stop here.
The lesson to learn here is that C isn’t just C. The algorithm you select for implementing the function will of cause impact the performance.
Your implementation is way too simple. That’s why pandas was faster. An optimized C implementation is likely to be faster than pandas but at least as fast as pandas.
BTW:
So I’m pushing the CPU using the C code. How is it that pandas even push it further?
Well, tricks can be played to optimize e.g. cache usage, benefit from branch prediction, use special CPU instructions, etc. But I don’t think that is the main thing here… Instead the trick is to get the same result by doing less operations by selecting a better algorithm. At least that is step 1.
I have an nparray of x,y pairs with shape (n,2)
, and knowing for certain that for each x there are multiple values of y , I wanted to calculate the average value of y for each unique x. It occurred to me that this needed a groupby
operation followed by a mean
which was available in pandas library. However, thinking that pandas was slow because of the scale of my data (over a million points) I wrote a simple program in C and used ctypes to call the C function and perform the operation. I used -fPIC
and compiled a shared object
file with GCC MinGW
.
int average( int* array , int size_array , int* unique , int size_unique , float* avg ){
if (size_array % 2 != 0){
return 1;
}
for (int i = 0 ; i < size_unique ; i++){
int curX = unique[i];
int sum = 0;
int count = 0;
for (int j = 0 ; j < size_array ; j += 2){
if ( array[j] == curX ){
sum += (array[j+1]);
count += 1;
}
}
float average = ((float)sum / (float)count);
avg[i] = average;
}
return 0;
}
Later, because the program was still slow (took about 1.5 seconds) I gave pandas a shot and I was stunned at how faster it was using it. It was almost twice as fast as the program I wrote in C. But it didn’t make any sense to me. How did they achieve this level of performance? Is pandas using a hashtable?
ar = np.random.randint(0,2000,size = (40000,2))
df = pd.DataFrame({'x': ar[:,0], 'y': ar[:,1]})
df = df.groupby('y', as_index=False)['x'].mean()
x = df[['x']].to_numpy()
y = df[['y']].to_numpy()
I calculated and found out that for an array of size (40000,2)
and 2000
unique elements, I had about 80,000,000
operation which was done in less than 0.2s
. So each operation takes about 2.5 nanoseconds which is close to my processor’s limit (I have a 3.5Ghz Quad Core CPU – intel i7 4720HQ). So I’m pushing the CPU using the C code. How is it that pandas even push it further?
As I mentioned above, I compiled the C code with GCC MinGW and the following command:
gcc -fPIC -shared c_out.so c_in.c
.
I used pythons time library to estimate the runtime of the code. I initialized two timestamps one before (t1
) and one after the code (t2
). Afterward, I print the time difference between t1 and t2 as the runtime of the code.
Samples of the benchmark for 40000 data ranging from 0 to 2448 are as follows:
Run #1
Pandas: 0.0623
C: 0.2250
Run #2
Pandas: 0.0660
C: 0.1880
Run #3
Pandas: 0.609
C: 0.2261
Run #4
Pandas: 0.0629
C: 0.2488
Run #5
Pandas: 0.0619
C: 0.2159
I don’t know pandas but I’m pretty sure that the developers have spend quite some time optimizing the implementation. There are several ways to optimize code. Selecting a good algorithm for the job is one.
Your C code implementation isn’t optimized. In contrary it’s a very simple brute force implementation that iterates the main array again and again.
So you are comparing things that can’t really be compared, i.e. a simple C implementation and a most likely optimized pandas implementation. It’s no wonder that the pandas implementation wins.
Also it’s important to compile your C code with optimization (typically -O2). It’s not clear whether you did that.
I tested your implementation of average
on my server and compiled it using gcc -Wall -Wextra -Werror prog.c
. The result was:
Using average, time: 0.2207540000000000
Then I turned on optimization, i.e. gcc -Wall -Wextra -Werror -O2 prog.c
. The result was:
Using average, time: 0.0833040000000000
Nearly 3 times faster – just by using -O2
optimization. The lesson is: When measuring performance make sure to turn on a proper optimization level.
Next step… can your implementation be improved?
Here is a first change:
struct accu
{
int sum;
int count;
};
int average_v1( int* array , int size_array , int* unique , int size_unique , float* avg )
{
if (size_array % 2 != 0)
{
return 1;
}
struct accu * accu = calloc(size_unique, sizeof *accu);
assert(accu != NULL);
for (int j = 0 ; j < size_array ; j += 2)
{
int flag = 0;
for (int i = 0 ; i < size_unique ; i++)
{
if ( array[j] == unique[i])
{
accu[i].sum += array[j+1];
++accu[i].count;
flag = 1;
break;
}
}
if (flag == 0)
{
puts("Invalid input datan");
exit(1);
}
}
for (int i = 0 ; i < size_unique ; i++)
{
avg[i] = ((float)accu[i].sum / (float)accu[i].count);
}
free(accu);
return 0;
}
The idea is pretty simple:
-
Swap the inner and outer loops, and break out of inner loop when value is matched. On average this will reduce the number of loops to the half.
-
Use dynamic allocated memory to hold intermediate results.
Again using gcc -Wall -Wextra -Werror -O2 prog.c
and running both your average
and my average_v1
on the same data set gives me:
Using average, time: 0.0833580000000000
Using average_v1, time: 0.0280810000000000
So average_v1
is nearly 3 times faster!
But can we do even better?
Yes, here is a little more complex implementation that uses binary search to find elements in unique
. Binary search requires a sorted array so a copy of unique
is made and qsort
is used. The original array index is also saved in order to get the exact same output.
struct sorted_unique
{
int value;
int org_index;
};
int comp(const void * a, const void *b)
{
struct sorted_unique * pa = (struct sorted_unique *)a;
struct sorted_unique * pb = (struct sorted_unique *)b;
if (pa->value > pb->value) return 1;
if (pa->value < pb->value) return -1;
return 0;
}
int average_v2( int* array , int size_array , int* unique , int size_unique , float* avg )
{
if (size_array % 2 != 0)
{
return 1;
}
struct sorted_unique * psu = malloc(size_unique * sizeof *psu);
assert(psu != NULL);
for (int i = 0; i < size_unique; ++i)
{
psu[i].value = unique[i];
psu[i].org_index = i;
}
qsort(psu, size_unique, sizeof *psu, comp);
struct accu * accu = calloc(size_unique, sizeof *accu);
assert(accu != NULL);
for (int j = 0 ; j < size_array ; j += 2)
{
int low = 0;
int high = size_unique-1;
int mid;
int v = array[j];
// Binary search
while(1)
{
if (low <= high)
{
mid = low + (high - low)/2;
if (psu[mid].value == v) break;
if (v < psu[mid].value)
{
high = mid-1;
}
else
{
low = mid + 1;
}
}
else
{
puts("Invalid input datan");
exit(1);
}
}
accu[mid].sum += array[j+1];
++accu[mid].count;
}
for (int i = 0 ; i < size_unique ; i++)
{
avg[psu[i].org_index] = ((float)accu[i].sum / (float)accu[i].count);
}
free(accu);
free(psu);
return 0;
}
Now I get:
Using average, time: 0.0833030000000000
Using average_v1, time: 0.0279690000000000
Using average_v2, time: 0.0032790000000000
So average_v2
is 25 times faster than average
.
But can we do even better?
Probably… But I’ll stop here.
The lesson to learn here is that C isn’t just C. The algorithm you select for implementing the function will of cause impact the performance.
Your implementation is way too simple. That’s why pandas was faster. An optimized C implementation is likely to be faster than pandas but at least as fast as pandas.
BTW:
So I’m pushing the CPU using the C code. How is it that pandas even push it further?
Well, tricks can be played to optimize e.g. cache usage, benefit from branch prediction, use special CPU instructions, etc. But I don’t think that is the main thing here… Instead the trick is to get the same result by doing less operations by selecting a better algorithm. At least that is step 1.