python runtime 3x deviation for 32 vs 34 char IDs

Question

I am running an aggregation script, which heavily relies on aggregating / grouping on an identifier column. Each identifier in this column is 32 character long as a result of a hashing function.

so my ID column which will be used in pandas groupby has something like

e667sad2345...1238a

as an entry.

I tried to add a prefix "ID" to some of the samples, for easier separation afterwards. Thus, I had some identifiers with 34 characters and others still with 32 characters.

e667sad2345...1238a
IDf7901ase323...1344b

Now the aggregation script takes 3 times as long (6000 vs 2000 seconds). And the change in the ID column (adding the prefix) is the only thing which happened. Also note, that I generate data separately and save a pickle file which is read in by my aggregation script as input. So the prefix addition is not part of the runtime I am talking about.

So now I am stunned, why this particular change made such a huge impact. Can someone elaborate?

EDIT: I replaced the prefix with suffix so now it is

e667sad2345...1238a
f7901ase323...1344bID

and now it runs again in 2000 seconds. Does groupby use a binary search or something, so all the ID are overrepresented with the starting character ‘I’ ?

Asked By: Alex

||

Source

Answer 1

Ok, I had a revelation what is going on.

My entries are sorted using quick sort, which has an expected runtime of O(n * log n). But in worst case, quick sort will actually run in O(n*n). By making my entries imbalanced (20% of data starts with "I", other 80% randomly distributed over alphanumeric characters) I shifted the data to be more of a bad case for quick sort.

Answered By: Alex

python runtime 3x deviation for 32 vs 34 char IDs

Question:

Answers: