How is that pandas is faster than pure C in the groupby operation?

Question:

I have an nparray of x,y pairs with shape (n,2), and knowing for certain that for each x there are multiple values of y , I wanted to calculate the average value of y for each unique x. It occurred to me that this needed a groupby operation followed by a mean which was available in pandas library. However, thinking that pandas was slow because of the scale of my data (over a million points) I wrote a simple program in C and used ctypes to call the C function and perform the operation. I used -fPIC and compiled a shared object file with GCC MinGW.

int average( int* array , int size_array , int* unique , int size_unique , float* avg ){

    if (size_array % 2 != 0){
        return 1;
    }

    for (int i = 0 ; i < size_unique ; i++){

        int curX = unique[i];
        int sum = 0;
        int count = 0;

        for (int j = 0 ; j < size_array ; j += 2){
            if ( array[j] == curX ){
                sum += (array[j+1]);
                count += 1;
            }
        }

        float average = ((float)sum / (float)count);

        avg[i] = average;

    }

    return 0;

}

Later, because the program was still slow (took about 1.5 seconds) I gave pandas a shot and I was stunned at how faster it was using it. It was almost twice as fast as the program I wrote in C. But it didn’t make any sense to me. How did they achieve this level of performance? Is pandas using a hashtable?

ar = np.random.randint(0,2000,size = (40000,2))
df = pd.DataFrame({'x': ar[:,0], 'y': ar[:,1]})
df = df.groupby('y', as_index=False)['x'].mean()
x = df[['x']].to_numpy()
y = df[['y']].to_numpy()

I calculated and found out that for an array of size (40000,2) and 2000 unique elements, I had about 80,000,000 operation which was done in less than 0.2s. So each operation takes about 2.5 nanoseconds which is close to my processor’s limit (I have a 3.5Ghz Quad Core CPU – intel i7 4720HQ). So I’m pushing the CPU using the C code. How is it that pandas even push it further?

As I mentioned above, I compiled the C code with GCC MinGW and the following command:
gcc -fPIC -shared c_out.so c_in.c.

I used pythons time library to estimate the runtime of the code. I initialized two timestamps one before (t1) and one after the code (t2). Afterward, I print the time difference between t1 and t2 as the runtime of the code.
Samples of the benchmark for 40000 data ranging from 0 to 2448 are as follows:

Run #1
    Pandas: 0.0623
    C: 0.2250

Run #2
    Pandas: 0.0660
    C: 0.1880

Run #3
    Pandas: 0.609
    C: 0.2261

Run #4
    Pandas: 0.0629
    C: 0.2488

Run #5
    Pandas: 0.0619
    C: 0.2159
Asked By: ARK1375

||

Answers:

I don’t know pandas but I’m pretty sure that the developers have spend quite some time optimizing the implementation. There are several ways to optimize code. Selecting a good algorithm for the job is one.

Your C code implementation isn’t optimized. In contrary it’s a very simple brute force implementation that iterates the main array again and again.

So you are comparing things that can’t really be compared, i.e. a simple C implementation and a most likely optimized pandas implementation. It’s no wonder that the pandas implementation wins.

Also it’s important to compile your C code with optimization (typically -O2). It’s not clear whether you did that.

I tested your implementation of average on my server and compiled it using gcc -Wall -Wextra -Werror prog.c. The result was:

Using average,    time: 0.2207540000000000

Then I turned on optimization, i.e. gcc -Wall -Wextra -Werror -O2 prog.c. The result was:

Using average,    time: 0.0833040000000000

Nearly 3 times faster – just by using -O2 optimization. The lesson is: When measuring performance make sure to turn on a proper optimization level.

Next step… can your implementation be improved?

Here is a first change:

struct accu
{
    int sum;
    int count;
};


int average_v1( int* array , int size_array , int* unique , int size_unique , float* avg )
{

    if (size_array % 2 != 0)
    {
        return 1;
    }

    struct accu * accu = calloc(size_unique, sizeof *accu);
    assert(accu != NULL);

    for (int j = 0 ; j < size_array ; j += 2)
    {
      int flag = 0;
      for (int i = 0 ; i < size_unique ; i++)
      {
        if ( array[j] ==  unique[i])
        {
          accu[i].sum += array[j+1];
          ++accu[i].count;
          flag = 1;
          break;
        }
      }
      if (flag == 0)
      {
        puts("Invalid input datan");
        exit(1);
      }
    }

    for (int i = 0 ; i < size_unique ; i++)
    {
      avg[i]  = ((float)accu[i].sum / (float)accu[i].count);
    }

    free(accu);

    return 0;
}

The idea is pretty simple:

  1. Swap the inner and outer loops, and break out of inner loop when value is matched. On average this will reduce the number of loops to the half.

  2. Use dynamic allocated memory to hold intermediate results.

Again using gcc -Wall -Wextra -Werror -O2 prog.c and running both your average and my average_v1 on the same data set gives me:

Using average,    time: 0.0833580000000000
Using average_v1, time: 0.0280810000000000

So average_v1 is nearly 3 times faster!

But can we do even better?

Yes, here is a little more complex implementation that uses binary search to find elements in unique. Binary search requires a sorted array so a copy of unique is made and qsort is used. The original array index is also saved in order to get the exact same output.

struct sorted_unique
{
    int value;
    int org_index;
};

int comp(const void * a, const void *b)
{
  struct sorted_unique * pa = (struct sorted_unique *)a;
  struct sorted_unique * pb = (struct sorted_unique *)b;
  if (pa->value > pb->value) return 1;
  if (pa->value < pb->value) return -1;
  return 0;
}

int average_v2( int* array , int size_array , int* unique , int size_unique , float* avg )
{
  if (size_array % 2 != 0)
  {
    return 1;
  }

  struct sorted_unique * psu = malloc(size_unique * sizeof *psu);
  assert(psu != NULL);
  for (int i = 0; i < size_unique; ++i)
  {
    psu[i].value = unique[i];
    psu[i].org_index = i;
  }
  qsort(psu, size_unique, sizeof *psu, comp);

  struct accu * accu = calloc(size_unique, sizeof *accu);
  assert(accu != NULL);

  for (int j = 0 ; j < size_array ; j += 2)
  {
    int low = 0;
    int high = size_unique-1;
    int mid;
    int v = array[j];

    // Binary search
    while(1)
    {
      if (low <= high)
      {
        mid = low + (high - low)/2;
        if (psu[mid].value == v) break;

        if (v < psu[mid].value)
        {
          high = mid-1;
        }
        else
        {
          low = mid + 1;
        }
      }
      else
      {
        puts("Invalid input datan");
        exit(1);
      }
    }

    accu[mid].sum += array[j+1];
    ++accu[mid].count;
  }

  for (int i = 0 ; i < size_unique ; i++)
  {
    avg[psu[i].org_index] = ((float)accu[i].sum / (float)accu[i].count);
  }

  free(accu);
  free(psu);

  return 0;
}

Now I get:

Using average,    time: 0.0833030000000000
Using average_v1, time: 0.0279690000000000
Using average_v2, time: 0.0032790000000000

So average_v2 is 25 times faster than average.

But can we do even better?

Probably… But I’ll stop here.

The lesson to learn here is that C isn’t just C. The algorithm you select for implementing the function will of cause impact the performance.

Your implementation is way too simple. That’s why pandas was faster. An optimized C implementation is likely to be faster than pandas but at least as fast as pandas.

BTW:

So I’m pushing the CPU using the C code. How is it that pandas even push it further?

Well, tricks can be played to optimize e.g. cache usage, benefit from branch prediction, use special CPU instructions, etc. But I don’t think that is the main thing here… Instead the trick is to get the same result by doing less operations by selecting a better algorithm. At least that is step 1.

Answered By: Support Ukraine