pandas split values in column

Question:

I’m new to pandas (version 1.1.5) and have tried str.split() and str.extract() to split column POS of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _ and - delimiters in subset):

df.head()

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965-70531965       C       G
5  Sample5     6    26093141-26093141       G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484_103527848       G       A

I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965        C       G
5  Sample5     6    26093141        G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484       G       A

My attempts have either split the rows only containing a delimiter and dropping all other rows, dropping all rows containing just the delimiters, or dropping all values.

For example, df['POS'] = df['POS'].str.replace(r'[-|_]d+', '') outputs:

      SAMPLE CHROM  POS REF ALT
1  Sample1     7    NaN   C   T
2  Sample2    17    NaN   C   A
3  Sample3     X    NaN   A   G
4  Sample4    16    NaN   C   G
5  Sample5     6    NaN   G   A
6  Sample6    12    NaN   C   T
7  Sample7     4    NaN   G   A

Accepting the solution from @PaulS below as I needed to convert the column datatype from object to string first in order for str.replace() to work!

df.dtypes

SAMPLE    object
CHROM     object
POS       object
REF       object
ALT       object
dtype: object

df['POS'] = df['POS'].astype('str')
df['POS'] = df['POS'].str.replace(r'[-|_]d+', '')
Asked By: eb0906

||

Answers:

A possible solution, based on the idea of replacing all characters after _ or - (inclusive) with the empty string (''):

df['POS'] = df['POS'].str.replace(r'[-_]d+', '')

Output:

  CHROM        POS REF ALT
0     7  105121514   C   T
1    17    7359940   C   A
2     X   76777781   A   G
3    16   70531965   C   G
4     6   26093141   G   A
5    12   11905465   C   T
6     4  103527484   G   A
Answered By: PaulS

If you are on pandas >= 1.4, you can use a regex with str.split(). Combine this with expand=True and then just take the first result and I think you’ve got what you need.

df['POS'] = df['POS'].str.split('[-_]',expand=True,regex=True)[0]
Answered By: scotscotmcc

This is not the most popular solution, but you can try.

df.POS = df.POS.str.replace("-", " ")
df.POS = df.POS.str.replace("_", " ")
df.POS = df.POS.str.split()
df.POS = [x[0] for x in df.POS]
Answered By: Mato
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.