How to extract substring from pandas column?

Question

I want to retrieve only the first part of the string for the entire column.

meta["Cell_type"].str.rstrip(" ")[0]

Data:

meta.iloc[1:5]

pd.DataFrame({'Assay Type': {'SRR9200814': 'RNA-Seq',
  'SRR9200815': 'RNA-Seq',
  'SRR9200816': 'RNA-Seq',
  'SRR9200817': 'RNA-Seq'},
 'Cell_type': {'SRR9200814': 'normal neural stem cells',
  'SRR9200815': 'normal neural stem cells',
  'SRR9200816': 'normal neural stem cells',
  'SRR9200817': 'normal neural stem cells'},
 'Type': {'SRR9200814': 'diploid',
  'SRR9200815': 'diploid',
  'SRR9200816': 'diploid',
  'SRR9200817': 'diploid'}})

Current output:

'normal neural stem cells'

Desired output:

Run
SRR9200813          normal
SRR9200814          glioblastoma 
SRR9200815          normal 
SRR9200816          normal
SRR9200817          normal

Asked By: melolili

||

Source

Answer 1

Here is a way that apply the function x.split(), that splits the string in token, to the entire column and takes the first element in the list.

df["Cell_type"].apply(lambda x : x.split()[0])

# SRR9200814    normal
# SRR9200815    normal
# SRR9200816    normal
# SRR9200817    normal

You can assign the result to the column if you want to modify your dataframe.

df["Cell_type"] =  df["Cell_type"].apply(lambda x : x.split()[0])

#            Assay Type Cell_type     Type
# SRR9200814    RNA-Seq    normal  diploid
# SRR9200815    RNA-Seq    normal  diploid
# SRR9200816    RNA-Seq    normal  diploid
# SRR9200817    RNA-Seq    normal  diploid

Answered By: Romain

Answer 2

meta["Cell_type"].str.split(n = 1, expand = True)[0]


SRR9200814    normal
SRR9200815    normal
SRR9200816    normal
SRR9200817    normal
Name: 0, dtype: object

ref : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html

Answered By: manoj

Answer 3

You can use str.split:

meta['Cell_type'] = meta['Cell_type'].str.split(' ').str[0]

Answered By: Abdulmajeed

Answer 4

Use str.extract instead of str.split:

>>> meta['Cell_type'].str.extract('(S+)', expand=False)
SRR9200814    normal
SRR9200815    normal
SRR9200816    normal
SRR9200817    normal
Name: Cell_type, dtype: object

Note: S matches any non-whitespace character.

You can also do:

meta['Result'] = meta['Cell_type'].str.extract('(S+)')
print(meta)

# Output
           Assay Type                 Cell_type     Type  Result
SRR9200814    RNA-Seq  normal neural stem cells  diploid  normal
SRR9200815    RNA-Seq  normal neural stem cells  diploid  normal
SRR9200816    RNA-Seq  normal neural stem cells  diploid  normal
SRR9200817    RNA-Seq  normal neural stem cells  diploid  normal

Answered By: Corralien

How to extract substring from pandas column?

Question:

Answers: