python runtime 3x deviation for 32 vs 34 char IDs
Question:
I am running an aggregation script, which heavily relies on aggregating / grouping on an identifier column. Each identifier in this column is 32 character long as a result of a hashing function.
so my ID column which will be used in pandas groupby has something like
e667sad2345...1238a
as an entry.
I tried to add a prefix "ID" to some of the samples, for easier separation afterwards. Thus, I had some identifiers with 34 characters and others still with 32 characters.
e667sad2345...1238a
IDf7901ase323...1344b
Now the aggregation script takes 3 times as long (6000 vs 2000 seconds). And the change in the ID column (adding the prefix
) is the only thing which happened. Also note, that I generate data separately and save a pickle file which is read in by my aggregation script as input. So the prefix addition is not part of the runtime I am talking about.
So now I am stunned, why this particular change made such a huge impact. Can someone elaborate?
EDIT: I replaced the prefix with suffix so now it is
e667sad2345...1238a
f7901ase323...1344bID
and now it runs again in 2000 seconds. Does groupby use a binary search or something, so all the ID are overrepresented with the starting character ‘I’ ?
Answers:
Ok, I had a revelation what is going on.
My entries are sorted using quick sort, which has an expected runtime of O(n * log n). But in worst case, quick sort will actually run in O(n*n). By making my entries imbalanced (20% of data starts with "I", other 80% randomly distributed over alphanumeric characters) I shifted the data to be more of a bad case for quick sort.
I am running an aggregation script, which heavily relies on aggregating / grouping on an identifier column. Each identifier in this column is 32 character long as a result of a hashing function.
so my ID column which will be used in pandas groupby has something like
e667sad2345...1238a
as an entry.
I tried to add a prefix "ID" to some of the samples, for easier separation afterwards. Thus, I had some identifiers with 34 characters and others still with 32 characters.
e667sad2345...1238a
IDf7901ase323...1344b
Now the aggregation script takes 3 times as long (6000 vs 2000 seconds). And the change in the ID column (adding the prefix
) is the only thing which happened. Also note, that I generate data separately and save a pickle file which is read in by my aggregation script as input. So the prefix addition is not part of the runtime I am talking about.
So now I am stunned, why this particular change made such a huge impact. Can someone elaborate?
EDIT: I replaced the prefix with suffix so now it is
e667sad2345...1238a
f7901ase323...1344bID
and now it runs again in 2000 seconds. Does groupby use a binary search or something, so all the ID are overrepresented with the starting character ‘I’ ?
Ok, I had a revelation what is going on.
My entries are sorted using quick sort, which has an expected runtime of O(n * log n). But in worst case, quick sort will actually run in O(n*n). By making my entries imbalanced (20% of data starts with "I", other 80% randomly distributed over alphanumeric characters) I shifted the data to be more of a bad case for quick sort.