Extract int from objects in column that has multiple metrics that need scaling

Question:

My column has over 9000 rows of Mb and Kb objects appearing as numbers, that need to be converted all into Kb values (by multiplying Mb values to 1000). This is a snippet of the column values:

array(['5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
       '72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '23k',
       '7.3M', '6.5M', '1.5M', '7.5M'])

I used str.extract below, and it works in converting the objects into int, however I lose the M character and I cannot track which numbers to multiply to 1000

df.Size = df['Size'].str.extract('(d+)').astype(int)

I’ve also tried creating a list of the Mb values to then use a custom function with .apply() to multiply, but I believe there is something wrong with my code:

mblist = df['Size'].str.extract('(d+[M$])').astype(int)

I need the column to all become int and all in Kb metric, so I can conduct analysis on it. Is there a better pathway to get there?

Asked By: Priscilla J

||

Answers:

We can use np.where() here along with str.extract:

df["Size"] = np.where(df["Size"].str.contains(r'M$', regex=True),
                      1000.0*df["Size"].str.extract('(d+(?:.d+)?)').astype(float),
                      df["Size"].str.extract('(d+(?:.d+)?)').astype(float))

The above logic extracts the float (or integer) number from the size field, and multiplies by 1000, in the case that the unit be M for megabytes. Data will then be reported everywhere as KB.

Answered By: Tim Biegeleisen