Pandas – How to convert an string column into Integer… then convert into String with 10 charact
Question:
I’m performing a data analysis where one of the steps is to create a key by combining several fields.
Unfortunally, the number of digits in a given field is not always the same.
Some information
- Datatype of
my_field
is object
;
nan
values have been replaced by the '-'
character.
- But, basically, the
my_field
is numbers (INTEGER) formatted in Text.
Code
import pandas as pd
import numpy as np
data ={'product': ['PA1', 'PA2', 'PA3', 'PA4', 'PA5', 'PA6', 'PA7', 'PA8'],
'my_field': ['001', '0000000000002', '3', '04', '-', '5', '-', '6']}
df = pd.DataFrame(data)
df
Raw Data
product
my_field
0
PA1
001
1
PA2
0000000000002
2
PA3
3
3
PA4
04
4
PA5
–
5
PA6
5
6
PA7
–
7
PA8
6
My Aproach:
df['my_field'] = np.where(df['my_field'] == '-', '-' , df['my_field'].str.zfill(10) )
df
My Output:
product
my_field
0
PA1
0000000001
1
PA2
0000000000002
2
PA3
0000000003
3
PA4
0000000004
4
PA5
–
5
PA6
0000000005
6
PA7
–
7
PA8
0000000006
Desired Output:
product
my_field
0
PA1
0000000001
1
PA2
0000000002
2
PA3
0000000003
3
PA4
0000000004
4
PA5
–
5
PA6
0000000005
6
PA7
–
7
PA8
0000000006
The problem: Some outputs get more then 10 char.
Answers:
An alternative solution using len():
def myfield_format(x):
if len(x)>10:
field=str(x)[(len(str(x))-10):] if x!='-' else '-'
else:
field=(10-len(str(x)))*'0'+str(x) if x!='-' else '-'
return field
df['my_field']=df['my_field'].map(lambda x: myfield_format(x))
product
my_field
PA1
0000000001
PA2
0000000002
PA3
0000000003
PA4
0000000004
PA5
–
PA6
0000000005
PA7
–
PA8
0000000006
What about slicing after zfill
, this way you’ll keep the last 10 characters only:
df['my_field'] = np.where(df['my_field'] == '-', '-', df['my_field'].str.zfill(10).str[-10:])
Alternative with boolean indexing:
df.loc[df['my_field'] != '-',
'my_field'] = df['my_field'].str.zfill(10).str[-10:]
Output:
product my_field
0 PA1 0000000001
1 PA2 0000000002
2 PA3 0000000003
3 PA4 0000000004
4 PA5 -
5 PA6 0000000005
6 PA7 -
7 PA8 0000000006
df.assign(my_field=df.my_field.map(lambda x:str(int(x)).zfill(10) if x.isdigit() else x))
product my_field
0 PA1 0000000001
1 PA2 0000000002
2 PA3 0000000003
3 PA4 0000000004
4 PA5 -
5 PA6 0000000005
6 PA7 -
7 PA8 0000000006
I’m performing a data analysis where one of the steps is to create a key by combining several fields.
Unfortunally, the number of digits in a given field is not always the same.
Some information
- Datatype of
my_field
isobject
; nan
values have been replaced by the'-'
character.- But, basically, the
my_field
is numbers (INTEGER) formatted in Text.
Code
import pandas as pd
import numpy as np
data ={'product': ['PA1', 'PA2', 'PA3', 'PA4', 'PA5', 'PA6', 'PA7', 'PA8'],
'my_field': ['001', '0000000000002', '3', '04', '-', '5', '-', '6']}
df = pd.DataFrame(data)
df
Raw Data
product | my_field | |
---|---|---|
0 | PA1 | 001 |
1 | PA2 | 0000000000002 |
2 | PA3 | 3 |
3 | PA4 | 04 |
4 | PA5 | – |
5 | PA6 | 5 |
6 | PA7 | – |
7 | PA8 | 6 |
My Aproach:
df['my_field'] = np.where(df['my_field'] == '-', '-' , df['my_field'].str.zfill(10) )
df
My Output:
product | my_field | |
---|---|---|
0 | PA1 | 0000000001 |
1 | PA2 | 0000000000002 |
2 | PA3 | 0000000003 |
3 | PA4 | 0000000004 |
4 | PA5 | – |
5 | PA6 | 0000000005 |
6 | PA7 | – |
7 | PA8 | 0000000006 |
Desired Output:
product | my_field | |
---|---|---|
0 | PA1 | 0000000001 |
1 | PA2 | 0000000002 |
2 | PA3 | 0000000003 |
3 | PA4 | 0000000004 |
4 | PA5 | – |
5 | PA6 | 0000000005 |
6 | PA7 | – |
7 | PA8 | 0000000006 |
The problem: Some outputs get more then 10 char.
An alternative solution using len():
def myfield_format(x):
if len(x)>10:
field=str(x)[(len(str(x))-10):] if x!='-' else '-'
else:
field=(10-len(str(x)))*'0'+str(x) if x!='-' else '-'
return field
df['my_field']=df['my_field'].map(lambda x: myfield_format(x))
product | my_field |
---|---|
PA1 | 0000000001 |
PA2 | 0000000002 |
PA3 | 0000000003 |
PA4 | 0000000004 |
PA5 | – |
PA6 | 0000000005 |
PA7 | – |
PA8 | 0000000006 |
What about slicing after zfill
, this way you’ll keep the last 10 characters only:
df['my_field'] = np.where(df['my_field'] == '-', '-', df['my_field'].str.zfill(10).str[-10:])
Alternative with boolean indexing:
df.loc[df['my_field'] != '-',
'my_field'] = df['my_field'].str.zfill(10).str[-10:]
Output:
product my_field
0 PA1 0000000001
1 PA2 0000000002
2 PA3 0000000003
3 PA4 0000000004
4 PA5 -
5 PA6 0000000005
6 PA7 -
7 PA8 0000000006
df.assign(my_field=df.my_field.map(lambda x:str(int(x)).zfill(10) if x.isdigit() else x))
product my_field
0 PA1 0000000001
1 PA2 0000000002
2 PA3 0000000003
3 PA4 0000000004
4 PA5 -
5 PA6 0000000005
6 PA7 -
7 PA8 0000000006