Format decimals as percentages in a column

Question:

Let’s say I have the following pandas DataFrame:

df = pd.DataFrame({'name': ['Johnny', 'Brad'], 'rating': [1.0, 0.9]})

I want to convert the rating column from a decimal to a percentage as a string (e.g. 1.0 to '100%'). The following works okay:

def decimal_to_percent_string(row):
    return '{}%'.format(row['rating'] * 100)

df['rating'] = df.apply(func=decimal_to_percent_string, axis=1)

This seems very inefficient to me as it applies the function to the entire DataFrame which isn’t ideal because my DataFrame is very large. Is there a better way to do this?

Asked By: Johnny Metz

||

Answers:

Use pandas’ broadcasting operations:

df.rating = (df.rating * 100).astype(str) + '%'
df 
     name  rating
0  Johnny  100.0%
1    Brad   90.0%

Alternatively, using df.mul and df.add:

df.rating = df.rating.mul(100).astype(str).add('%')
df
     name  rating
0  Johnny  100.0%
1    Brad   90.0%
Answered By: cs95
df['rating'] = df['rating'].mul(100).astype(int).astype(str).add('%')
print(df)

Output:

     name rating
0  Johnny   100%
1    Brad    90%
Answered By: Scott Boston

Try this:

df['rating'] = pd.Series(["{0:.2f}%".format(val*100) for val in df['rating']], index = df.index)
print(df)

The output is:

     name    rating
0   Johnny   100.00%
1   Brad     90.00%
Answered By: whateveros

1. Solution for display only

If you just want the DataFrame to display that column as a %, it’s better to use a formatter since then the rating column isn’t actually changed, and so you can perform further operations on it.

df.style.format({'rating': '{:.2%}'.format})

Now print(df) will show:

     name  rating
0  Johnny 100.00%
1    Brad  90.00%

2. Solution with conversion

If you actually need to convert the field to a string (e.g. for ETL purposes), this command is both more idiomatic AND fastest on large and small DataFrames:

df['rating'] = df['rating'].apply('{:.2%}'.format)

Now the rating column is a string and it displays identically to the above result.

Speed test

import sys
import timeit
import pandas as pd

print(f"Pandas: {pd.__version__} Python: {sys.version[:5]}n")

for cur_size in [1, 10, 100, 1000, 10000, 100000, 1000000]:
    mysetup = (f"import pandas as pd; df = pd.DataFrame({{"
        f"'name': ['Johnny', 'Brad']*{cur_size}, "
        f"'rating': [1.0, 0.9]*{cur_size}}}); "
        f"ff = '{{:.2f}}%'.format")

    cs95    = "df.rating.mul(100).astype(str).add('%')"
    michael = "df['rating'].apply(ff)"

    speeds = []
    for stmt in [cs95, michael]:
        speeds.append(timeit.timeit(setup=mysetup, stmt=stmt, number=100))

    print(f"Length: {cur_size*2}.  {speeds[0]:.2f}s vs {speeds[1]:.2f}s")

Results:

Pandas: 1.4.3 Python: 3.9.7

Length: 2.         0.02s vs  0.01s
Length: 20.        0.02s vs  0.02s
Length: 200.       0.03s vs  0.03s
Length: 2000.      0.09s vs  0.08s
Length: 20000.     0.79s vs  0.65s
Length: 200000.    8.44s vs  6.94s
Length: 2000000.  90.44s vs 73.57s

Conclusion: the apply method is more idiomatic to pandas and Python, and has significantly better performance for larger dataframes.

Answered By: Michael Currie
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.