Extract int from objects in column that has multiple metrics that need scaling
Question:
My column has over 9000 rows of Mb and Kb objects appearing as numbers, that need to be converted all into Kb values (by multiplying Mb values to 1000). This is a snippet of the column values:
array(['5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '23k',
'7.3M', '6.5M', '1.5M', '7.5M'])
I used str.extract below, and it works in converting the objects into int, however I lose the M character and I cannot track which numbers to multiply to 1000
df.Size = df['Size'].str.extract('(d+)').astype(int)
I’ve also tried creating a list of the Mb values to then use a custom function with .apply() to multiply, but I believe there is something wrong with my code:
mblist = df['Size'].str.extract('(d+[M$])').astype(int)
I need the column to all become int and all in Kb metric, so I can conduct analysis on it. Is there a better pathway to get there?
Answers:
We can use np.where()
here along with str.extract
:
df["Size"] = np.where(df["Size"].str.contains(r'M$', regex=True),
1000.0*df["Size"].str.extract('(d+(?:.d+)?)').astype(float),
df["Size"].str.extract('(d+(?:.d+)?)').astype(float))
The above logic extracts the float (or integer) number from the size field, and multiplies by 1000, in the case that the unit be M
for megabytes. Data will then be reported everywhere as KB.
My column has over 9000 rows of Mb and Kb objects appearing as numbers, that need to be converted all into Kb values (by multiplying Mb values to 1000). This is a snippet of the column values:
array(['5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '23k',
'7.3M', '6.5M', '1.5M', '7.5M'])
I used str.extract below, and it works in converting the objects into int, however I lose the M character and I cannot track which numbers to multiply to 1000
df.Size = df['Size'].str.extract('(d+)').astype(int)
I’ve also tried creating a list of the Mb values to then use a custom function with .apply() to multiply, but I believe there is something wrong with my code:
mblist = df['Size'].str.extract('(d+[M$])').astype(int)
I need the column to all become int and all in Kb metric, so I can conduct analysis on it. Is there a better pathway to get there?
We can use np.where()
here along with str.extract
:
df["Size"] = np.where(df["Size"].str.contains(r'M$', regex=True),
1000.0*df["Size"].str.extract('(d+(?:.d+)?)').astype(float),
df["Size"].str.extract('(d+(?:.d+)?)').astype(float))
The above logic extracts the float (or integer) number from the size field, and multiplies by 1000, in the case that the unit be M
for megabytes. Data will then be reported everywhere as KB.