Remove string from alpha numeric column in python
Question:
I have a dataframe
import pandas as pd
data_as_dict={'CHROM': {232: 1, 233: 1, 234: 1, 10: 'chr15', 11: 'chr15'}, 'POS_GRCh38': {232: 10506158, 233: 109655507, 234: 113903258, 10: '67165147', 11: '67163292'}, 'REF': {232: 'G', 233: 'CAAA', 234: 'G', 10: 'G', 11: 'C'}, 'Effect_allele': {232: 'A', 233: 'C', 234: 'A', 10: 'C', 11: 'T'}, 'Effect_size': {232: 0.1109, 233: 0.0266, 234: 0.0579, 10: 0.2070141693843261, 11: 0.2151113796169455}, 'TYPE': {232: 'Mavaddat_2019_ER_NEG_Breast', 233: 'Mavaddat_2019_ER_NEG_Breast', 234: 'Mavaddat_2019_ER_NEG_Breast', 10: 'THYROID_PGS', 11: 'THYROID_PGS'}, 'Cancer': {232: 'Breast', 233: 'Breast', 234: 'Breast', 10: 'THYROID', 11: 'THYROID'}, 'Significant_YN': {232: 'Y', 233: 'Y', 234: 'Y', 10: 'Y', 11: 'Y'}}
all_cancers = pd.DataFrame.from_dict(data_as_dict)
I want to remove chr from CHROM
column. I tried all_cancers['CHROM'] = all_cancers['CHROM'].str.replace(r'chr', '')
which generates NaNs. I know it can be done easily in R with gsub
, but I wanted to try in python. How do I do it correctly?
Answers:
We could cast the column type as string and it should work
all_cancers['CHROM'] = all_cancers['CHROM'].astype(str).str.replace(r'chr', '')
-output
all_cancers
CHROM POS_GRCh38 REF Effect_allele Effect_size TYPE Cancer Significant_YN
232 1 10506158 G A 0.110900 Mavaddat_2019_ER_NEG_Breast Breast Y
233 1 109655507 CAAA C 0.026600 Mavaddat_2019_ER_NEG_Breast Breast Y
234 1 113903258 G A 0.057900 Mavaddat_2019_ER_NEG_Breast Breast Y
10 15 67165147 G C 0.207014 THYROID_PGS THYROID Y
11 15 67163292 C T 0.215111 THYROID_PGS THYROID
Using RegEx;
import re
all_cancers["CHROM"] = all_cancers["CHROM"].apply(lambda x: re.sub('D', '', str(x)))
I have a dataframe
import pandas as pd
data_as_dict={'CHROM': {232: 1, 233: 1, 234: 1, 10: 'chr15', 11: 'chr15'}, 'POS_GRCh38': {232: 10506158, 233: 109655507, 234: 113903258, 10: '67165147', 11: '67163292'}, 'REF': {232: 'G', 233: 'CAAA', 234: 'G', 10: 'G', 11: 'C'}, 'Effect_allele': {232: 'A', 233: 'C', 234: 'A', 10: 'C', 11: 'T'}, 'Effect_size': {232: 0.1109, 233: 0.0266, 234: 0.0579, 10: 0.2070141693843261, 11: 0.2151113796169455}, 'TYPE': {232: 'Mavaddat_2019_ER_NEG_Breast', 233: 'Mavaddat_2019_ER_NEG_Breast', 234: 'Mavaddat_2019_ER_NEG_Breast', 10: 'THYROID_PGS', 11: 'THYROID_PGS'}, 'Cancer': {232: 'Breast', 233: 'Breast', 234: 'Breast', 10: 'THYROID', 11: 'THYROID'}, 'Significant_YN': {232: 'Y', 233: 'Y', 234: 'Y', 10: 'Y', 11: 'Y'}}
all_cancers = pd.DataFrame.from_dict(data_as_dict)
I want to remove chr from CHROM
column. I tried all_cancers['CHROM'] = all_cancers['CHROM'].str.replace(r'chr', '')
which generates NaNs. I know it can be done easily in R with gsub
, but I wanted to try in python. How do I do it correctly?
We could cast the column type as string and it should work
all_cancers['CHROM'] = all_cancers['CHROM'].astype(str).str.replace(r'chr', '')
-output
all_cancers
CHROM POS_GRCh38 REF Effect_allele Effect_size TYPE Cancer Significant_YN
232 1 10506158 G A 0.110900 Mavaddat_2019_ER_NEG_Breast Breast Y
233 1 109655507 CAAA C 0.026600 Mavaddat_2019_ER_NEG_Breast Breast Y
234 1 113903258 G A 0.057900 Mavaddat_2019_ER_NEG_Breast Breast Y
10 15 67165147 G C 0.207014 THYROID_PGS THYROID Y
11 15 67163292 C T 0.215111 THYROID_PGS THYROID
Using RegEx;
import re
all_cancers["CHROM"] = all_cancers["CHROM"].apply(lambda x: re.sub('D', '', str(x)))