Problem with apply(int) to convert string to int in pandas

Question:

This question follows the question: Problem in Pandas : impossible to do sum of int with arbitrary precision and I used the accepted answer from there: df["my_int"].apply(int).sum()

But it does not work in all cases.

For example, with this file

my_int
9220426963983292163
5657924282683240

The ouput is -9220659185443576213

After looking at the apply(int) output, I understand the problem. In this case, apply(int) returns dtype:int64.

0    9220426963983292163
1       5657924282683240
Name: my_int, dtype: int64

But with large numbers, it returns dtype:object:

0    1111111111111111111111111111111111111111111111...
1    2222222222222222222222222222222222222222222222...
Name: my_int, dtype: object

Is it possible to solve it with pandas ?
Or should I follow Tim Robert’s answer from the previous question?

Edit 1:

Awful solution. A line is added to the end of the file with a large integer

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

And after, sum is done on all lines except the last one :

data['my_int'].apply(int).iloc[:-1].sum()
Asked By: Stef1611

||

Answers:

Solution :

df["my_int"].apply(int).astype(object).sum()

apply(int): To avoid string concatenation with large numbers.

astype(object): To convert int64 to object.

Answered By: Stef1611

Solution using Pandas:

sum(data[`my_int`].apply(int).to_list())

Why do I say so?

df1:

my_int
9220426963983292163
5657924282683240

df2:

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

Let S1 and S2 denote the sum of elements in the column my_int in df1 and df2, respectively:

S1 = 9226084888265975403
S2 = 11111111111111111111111111111111111111111111111111111120337195999377086514

If we check the documentation of NumPy on Overflow errors, we see that NumPy offers limited precision:

>>> np.iinfo(int)
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

The max number representable is smaller than both S1 and S2.

The solution works without any trouble and gives the correct sum:

>>> df2['my_int'].sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>> 
>>> df2['my_int'].astype(object).sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>>
>>> sum(df2['my_int'].apply(int).to_list())
11111111111111111111111111111111111111111111120337195999377086514

EDIT:
Prefer sum over np.sum:

>>> np.sum(df1['my_int'].apply(int).to_list())
>>> -9220659185443576213
>>> sum(df1['my_int'].apply(int).to_list())
>>> 9226084888265975403

Source of the calculation for sum of elements in the column my_int is WolframAlpha: df1, df2

Answered By: medium-dimensional
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.