Clean column in data frame
Question:
I’m trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: ‘S1234567’ and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
4 S0260425 Shippment US
I’ve menaged to separate those rows which has the format S1234567 + text using the code below:
merged_out['Remarks'] = merged_out['Remarks'].replace("t", "r")
merged_out['Remarks'] = merged_out['Remarks'].replace("n", "r")
s = merged_out['Remarks'].str.split("r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)
After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:
Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31 cok
5. S0246732-
6. 2
7. ok
8. nan
And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.
Answers:
Here is a regex approach that will do what I think your question asks:
import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shippment UK',
'S0255111 Shippment UK',
'S0256352 Shippment UK',
'S0259138/S0259139 Shippment UK',
'S12345678 Shippment UK',
'S0260425 Shippment US']
})
pat = r'(?:(bSd{7})/)*(bSd{7}b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']),
merged_out],
axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r's*', '', regex=True)
Input:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138/S0259139 Shippment UK
4 S12345678 Shippment UK
5 S0260425 Shippment US
Output:
Number Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
3 S0259139 Shippment UK
5 S0260425 Shippment US
4 NaN S12345678 Shippment UK
Explanation:
- with
Series.str.extractall()
, use a pattern to obtain 0 or more occurrences of word boundary b
followed by S
followed by 7 digits and a 1 occurrence of S
followed by 7 digits (flanked by word boundaries b
)
- use
unstack()
to eliminate multiple index levels
- use
apply()
with dropna()
and tolist()
to create a new dataframe with a Number
column containing a list of numbers for each row
- use
explode()
to add new rows for lists with more than one Number
item
- with
Series.str.replace()
, filter out the number matches using the previous pattern, plus r's*'
to match trailing whitespace characters, to obtain the residual Remarks
Notes:
- all rows in the sample input contain one valid
Number
except that one row contains multiple Number
values separated by /
delimiters, and another row contains no valid Number
(it has S
followed by 8 digits, more than the 7 that make a valid Number
)
I think the easiest solution is to use regular expressions and a list comprehension:
import re
import pandas as pd
merged_out['Remarks'] = [re.split('s', i)[0] for i in merged_out['Remarks']]
Explanation:
This regular expression allows you to split the data when there is a space and makes a list from the i row in the column Remarks. With the 0, I selected the 0 element in this list. In this case, it is the number.
In this case, the list comprehension iterates through all the column in the dataset. In consequence, you will obtain the corresponding number of each row in the new column Remarks.
I’m trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: ‘S1234567’ and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
4 S0260425 Shippment US
I’ve menaged to separate those rows which has the format S1234567 + text using the code below:
merged_out['Remarks'] = merged_out['Remarks'].replace("t", "r")
merged_out['Remarks'] = merged_out['Remarks'].replace("n", "r")
s = merged_out['Remarks'].str.split("r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)
After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:
Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31 cok
5. S0246732-
6. 2
7. ok
8. nan
And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.
Here is a regex approach that will do what I think your question asks:
import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shippment UK',
'S0255111 Shippment UK',
'S0256352 Shippment UK',
'S0259138/S0259139 Shippment UK',
'S12345678 Shippment UK',
'S0260425 Shippment US']
})
pat = r'(?:(bSd{7})/)*(bSd{7}b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']),
merged_out],
axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r's*', '', regex=True)
Input:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138/S0259139 Shippment UK
4 S12345678 Shippment UK
5 S0260425 Shippment US
Output:
Number Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
3 S0259139 Shippment UK
5 S0260425 Shippment US
4 NaN S12345678 Shippment UK
Explanation:
- with
Series.str.extractall()
, use a pattern to obtain 0 or more occurrences of word boundaryb
followed byS
followed by 7 digits and a 1 occurrence ofS
followed by 7 digits (flanked by word boundariesb
) - use
unstack()
to eliminate multiple index levels - use
apply()
withdropna()
andtolist()
to create a new dataframe with aNumber
column containing a list of numbers for each row - use
explode()
to add new rows for lists with more than oneNumber
item - with
Series.str.replace()
, filter out the number matches using the previous pattern, plusr's*'
to match trailing whitespace characters, to obtain the residualRemarks
Notes:
- all rows in the sample input contain one valid
Number
except that one row contains multipleNumber
values separated by/
delimiters, and another row contains no validNumber
(it hasS
followed by 8 digits, more than the 7 that make a validNumber
)
I think the easiest solution is to use regular expressions and a list comprehension:
import re
import pandas as pd
merged_out['Remarks'] = [re.split('s', i)[0] for i in merged_out['Remarks']]
Explanation:
This regular expression allows you to split the data when there is a space and makes a list from the i row in the column Remarks. With the 0, I selected the 0 element in this list. In this case, it is the number.
In this case, the list comprehension iterates through all the column in the dataset. In consequence, you will obtain the corresponding number of each row in the new column Remarks.