How to extract substring from pandas column?
Question:
I want to retrieve only the first part of the string for the entire column.
meta["Cell_type"].str.rstrip(" ")[0]
Data:
meta.iloc[1:5]
pd.DataFrame({'Assay Type': {'SRR9200814': 'RNA-Seq',
'SRR9200815': 'RNA-Seq',
'SRR9200816': 'RNA-Seq',
'SRR9200817': 'RNA-Seq'},
'Cell_type': {'SRR9200814': 'normal neural stem cells',
'SRR9200815': 'normal neural stem cells',
'SRR9200816': 'normal neural stem cells',
'SRR9200817': 'normal neural stem cells'},
'Type': {'SRR9200814': 'diploid',
'SRR9200815': 'diploid',
'SRR9200816': 'diploid',
'SRR9200817': 'diploid'}})
Current output:
'normal neural stem cells'
Desired output:
Run
SRR9200813 normal
SRR9200814 glioblastoma
SRR9200815 normal
SRR9200816 normal
SRR9200817 normal
Answers:
Here is a way that apply
the function x.split()
, that splits the string in token, to the entire column and takes the first element in the list.
df["Cell_type"].apply(lambda x : x.split()[0])
# SRR9200814 normal
# SRR9200815 normal
# SRR9200816 normal
# SRR9200817 normal
You can assign the result to the column if you want to modify your dataframe.
df["Cell_type"] = df["Cell_type"].apply(lambda x : x.split()[0])
# Assay Type Cell_type Type
# SRR9200814 RNA-Seq normal diploid
# SRR9200815 RNA-Seq normal diploid
# SRR9200816 RNA-Seq normal diploid
# SRR9200817 RNA-Seq normal diploid
meta["Cell_type"].str.split(n = 1, expand = True)[0]
SRR9200814 normal
SRR9200815 normal
SRR9200816 normal
SRR9200817 normal
Name: 0, dtype: object
ref : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html
You can use str.split:
meta['Cell_type'] = meta['Cell_type'].str.split(' ').str[0]
Use str.extract
instead of str.split
:
>>> meta['Cell_type'].str.extract('(S+)', expand=False)
SRR9200814 normal
SRR9200815 normal
SRR9200816 normal
SRR9200817 normal
Name: Cell_type, dtype: object
Note: S
matches any non-whitespace character.
You can also do:
meta['Result'] = meta['Cell_type'].str.extract('(S+)')
print(meta)
# Output
Assay Type Cell_type Type Result
SRR9200814 RNA-Seq normal neural stem cells diploid normal
SRR9200815 RNA-Seq normal neural stem cells diploid normal
SRR9200816 RNA-Seq normal neural stem cells diploid normal
SRR9200817 RNA-Seq normal neural stem cells diploid normal
I want to retrieve only the first part of the string for the entire column.
meta["Cell_type"].str.rstrip(" ")[0]
Data:
meta.iloc[1:5]
pd.DataFrame({'Assay Type': {'SRR9200814': 'RNA-Seq',
'SRR9200815': 'RNA-Seq',
'SRR9200816': 'RNA-Seq',
'SRR9200817': 'RNA-Seq'},
'Cell_type': {'SRR9200814': 'normal neural stem cells',
'SRR9200815': 'normal neural stem cells',
'SRR9200816': 'normal neural stem cells',
'SRR9200817': 'normal neural stem cells'},
'Type': {'SRR9200814': 'diploid',
'SRR9200815': 'diploid',
'SRR9200816': 'diploid',
'SRR9200817': 'diploid'}})
Current output:
'normal neural stem cells'
Desired output:
Run
SRR9200813 normal
SRR9200814 glioblastoma
SRR9200815 normal
SRR9200816 normal
SRR9200817 normal
Here is a way that apply
the function x.split()
, that splits the string in token, to the entire column and takes the first element in the list.
df["Cell_type"].apply(lambda x : x.split()[0])
# SRR9200814 normal
# SRR9200815 normal
# SRR9200816 normal
# SRR9200817 normal
You can assign the result to the column if you want to modify your dataframe.
df["Cell_type"] = df["Cell_type"].apply(lambda x : x.split()[0])
# Assay Type Cell_type Type
# SRR9200814 RNA-Seq normal diploid
# SRR9200815 RNA-Seq normal diploid
# SRR9200816 RNA-Seq normal diploid
# SRR9200817 RNA-Seq normal diploid
meta["Cell_type"].str.split(n = 1, expand = True)[0]
SRR9200814 normal
SRR9200815 normal
SRR9200816 normal
SRR9200817 normal
Name: 0, dtype: object
ref : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html
You can use str.split:
meta['Cell_type'] = meta['Cell_type'].str.split(' ').str[0]
Use str.extract
instead of str.split
:
>>> meta['Cell_type'].str.extract('(S+)', expand=False)
SRR9200814 normal
SRR9200815 normal
SRR9200816 normal
SRR9200817 normal
Name: Cell_type, dtype: object
Note: S
matches any non-whitespace character.
You can also do:
meta['Result'] = meta['Cell_type'].str.extract('(S+)')
print(meta)
# Output
Assay Type Cell_type Type Result
SRR9200814 RNA-Seq normal neural stem cells diploid normal
SRR9200815 RNA-Seq normal neural stem cells diploid normal
SRR9200816 RNA-Seq normal neural stem cells diploid normal
SRR9200817 RNA-Seq normal neural stem cells diploid normal