How to separate strings from a column in pandas?
Question:
I have 2 columns:
A
B
1
ABCSD
2
SSNFs
3 CVY KIP
4 MSSSQ
5
ABCSD
6 MMS LLS
7
QQLL
This is an example actual files contains these type of cases in 1000+ rows.
I want to separate all the alphabets from column A and get them as output in column B:
Expected Output:
A
B
1
ABCSD
2
SSNFs
3
CVY KIP
4
MSSSQ
5
ABCSD
6
MMS LLS
7
QQLL
So Far I have tried this which works but looking for a better way:
df['B2'] = df['A'].str.split(' ').str[1:]
def try_join(l):
try:
return ' '.join(map(str, l))
except TypeError:
return np.nan
df['B2'] = [try_join(l) for l in df['B2']]
df = df.replace('', np.nan)
append=df['B2']
df['B']=df['B'].combine_first(append)
df['A']=[str(x).split(' ')[0] for x in df['A']]
df.drop(['B2'],axis=1,inplace=True)
df
Answers:
You could try as follows.
- Either use
str.extractall
with two named capture groups (generic: (?P<name>...)
) as A
and B
. First one for the digit(s) at the start, second one for the rest of the string. (You can easily adjust these patterns if your actual strings are less straightforward.) Finally, drop the added index level (1
) by using df.droplevel
.
- Or use
str.split
with n=1
and expand=True
and rename the columns (0
and 1
to A
and B
).
- Either option can be placed inside
df.update
with overwrite=True
to get the desired outcome.
import pandas as pd
import numpy as np
data = {'A': {0: '1', 1: '2', 2: '3 CVY KIP', 3: '4 MSSSQ',
4: '5', 5: '6 MMS LLS', 6: '7'},
'B': {0: 'ABCSD', 1: 'SSNFs', 2: np.nan, 3: np.nan,
4: 'ABCSD', 5: np.nan, 6: 'QQLL'}
}
df = pd.DataFrame(data)
df.update(df.A.str.extractall(r'(?P<A>^d+)s(?P<B>.*)').droplevel(1),
overwrite=True)
# or in this case probably easier:
# df.update(df.A.str.split(pat=' ', n=1, expand=True)
# .rename(columns={0:'A',1:'B'}),overwrite=True)
df['A'] = df.A.astype(int)
print(df)
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL
You could use str.split()
if your number appears first.
df['A'].str.split(n=1,expand=True).set_axis(df.columns,axis=1).combine_first(df)
or
df['A'].str.extract(r'(?P<A>d+) (?P<B>[A-Za-z ]+)').combine_first(df)
Output:
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL
You can split on ' '
as it seems that the numeric value is always at the beginning and the text is after a space.
split = df.A.str.split(' ', 1)
df.loc[df.B.isnull(), 'B'] = split.str[1]
df.loc[:, 'A'] = split.str[0]
I have 2 columns:
A | B |
---|---|
1 | ABCSD |
2 | SSNFs |
3 CVY KIP | |
4 MSSSQ | |
5 | ABCSD |
6 MMS LLS | |
7 | QQLL |
This is an example actual files contains these type of cases in 1000+ rows.
I want to separate all the alphabets from column A and get them as output in column B:
Expected Output:
A | B |
---|---|
1 | ABCSD |
2 | SSNFs |
3 | CVY KIP |
4 | MSSSQ |
5 | ABCSD |
6 | MMS LLS |
7 | QQLL |
So Far I have tried this which works but looking for a better way:
df['B2'] = df['A'].str.split(' ').str[1:]
def try_join(l):
try:
return ' '.join(map(str, l))
except TypeError:
return np.nan
df['B2'] = [try_join(l) for l in df['B2']]
df = df.replace('', np.nan)
append=df['B2']
df['B']=df['B'].combine_first(append)
df['A']=[str(x).split(' ')[0] for x in df['A']]
df.drop(['B2'],axis=1,inplace=True)
df
You could try as follows.
- Either use
str.extractall
with two named capture groups (generic:(?P<name>...)
) asA
andB
. First one for the digit(s) at the start, second one for the rest of the string. (You can easily adjust these patterns if your actual strings are less straightforward.) Finally, drop the added index level (1
) by usingdf.droplevel
. - Or use
str.split
withn=1
andexpand=True
and rename the columns (0
and1
toA
andB
). - Either option can be placed inside
df.update
withoverwrite=True
to get the desired outcome.
import pandas as pd
import numpy as np
data = {'A': {0: '1', 1: '2', 2: '3 CVY KIP', 3: '4 MSSSQ',
4: '5', 5: '6 MMS LLS', 6: '7'},
'B': {0: 'ABCSD', 1: 'SSNFs', 2: np.nan, 3: np.nan,
4: 'ABCSD', 5: np.nan, 6: 'QQLL'}
}
df = pd.DataFrame(data)
df.update(df.A.str.extractall(r'(?P<A>^d+)s(?P<B>.*)').droplevel(1),
overwrite=True)
# or in this case probably easier:
# df.update(df.A.str.split(pat=' ', n=1, expand=True)
# .rename(columns={0:'A',1:'B'}),overwrite=True)
df['A'] = df.A.astype(int)
print(df)
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL
You could use str.split()
if your number appears first.
df['A'].str.split(n=1,expand=True).set_axis(df.columns,axis=1).combine_first(df)
or
df['A'].str.extract(r'(?P<A>d+) (?P<B>[A-Za-z ]+)').combine_first(df)
Output:
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL
You can split on ' '
as it seems that the numeric value is always at the beginning and the text is after a space.
split = df.A.str.split(' ', 1)
df.loc[df.B.isnull(), 'B'] = split.str[1]
df.loc[:, 'A'] = split.str[0]