Clean column in data frame

Question:

I’m trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: ‘S1234567’ and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:

         Remarks
0   S0252508 Shippment UK
1   S0255111 Shippment UK
2   S0256352 Shippment UK
3   S0259138 Shippment UK
4   S0260425 Shippment US

I’ve menaged to separate those rows which has the format S1234567 + text using the code below:

merged_out['Remarks'] = merged_out['Remarks'].replace("t", "r")
merged_out['Remarks'] = merged_out['Remarks'].replace("n", "r")
s = merged_out['Remarks'].str.split("r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)

After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:

    Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31  cok
5. S0246732-
6. 2
7. ok
8. nan

And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.

Asked By: Ulewsky

||

Answers:

Here is a regex approach that will do what I think your question asks:

import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shippment UK',
'S0255111 Shippment UK',
'S0256352 Shippment UK',
'S0259138/S0259139 Shippment UK',
'S12345678 Shippment UK',
'S0260425 Shippment US']
})

pat = r'(?:(bSd{7})/)*(bSd{7}b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
    pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']), 
    merged_out], 
    axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r's*', '', regex=True)

Input:

                          Remarks
0           S0252508 Shippment UK
1           S0255111 Shippment UK
2           S0256352 Shippment UK
3  S0259138/S0259139 Shippment UK
4          S12345678 Shippment UK
5           S0260425 Shippment US

Output:

     Number                 Remarks
0  S0252508            Shippment UK
1  S0255111            Shippment UK
2  S0256352            Shippment UK
3  S0259138            Shippment UK
3  S0259139            Shippment UK
5  S0260425            Shippment US
4       NaN  S12345678 Shippment UK

Explanation:

  • with Series.str.extractall(), use a pattern to obtain 0 or more occurrences of word boundary b followed by S followed by 7 digits and a 1 occurrence of S followed by 7 digits (flanked by word boundaries b)
  • use unstack() to eliminate multiple index levels
  • use apply() with dropna() and tolist() to create a new dataframe with a Number column containing a list of numbers for each row
  • use explode() to add new rows for lists with more than one Number item
  • with Series.str.replace(), filter out the number matches using the previous pattern, plus r's*' to match trailing whitespace characters, to obtain the residual Remarks

Notes:

  • all rows in the sample input contain one valid Number except that one row contains multiple Number values separated by / delimiters, and another row contains no valid Number (it has S followed by 8 digits, more than the 7 that make a valid Number)
Answered By: constantstranger

I think the easiest solution is to use regular expressions and a list comprehension:

import re
import pandas as pd
merged_out['Remarks'] = [re.split('s', i)[0] for i in merged_out['Remarks']]

Explanation:

This regular expression allows you to split the data when there is a space and makes a list from the i row in the column Remarks. With the 0, I selected the 0 element in this list. In this case, it is the number.

In this case, the list comprehension iterates through all the column in the dataset. In consequence, you will obtain the corresponding number of each row in the new column Remarks.

Answered By: Angel Bujalance