pandas: merge strings when other columns satisfy a condition

Question:

I have a table:

genome  start   end   strand    etc
GUT_GENOME270877.fasta  98  396 +   
GUT_GENOME270877.fasta  384 574 -
GUT_GENOME270877.fasta  593 984 +
GUT_GENOME270877.fasta  991 999 -

I’d like to make a new table with column coordinates, which joins start and end columns and looking like this:

genome  start   end   strand    etc   coordinates
GUT_GENOME270877.fasta  98  396 +   98..396
GUT_GENOME270877.fasta  384 574 -   complement(384..574)
GUT_GENOME270877.fasta  593 984 +   593..984
GUT_GENOME270877.fasta  991 999 -   complement(991..999)

so that if there’s a - in the etc column, I’d like to do not just

df['coordinates'] = df['start'].astype(str) + '..' + df['end'].astype(str)

but to add brackets and complement, like this:

df['coordinates'] = 'complement(' + df['start'].astype(str) + '..' + df['end'].astype(str) + ')'

The only things i’m missing is how to introduce the condition.

Asked By: plnnvkv

||

Answers:

You can use numpy.where:

m = df['strand'].eq('-')

df['coordinates'] = (np.where(m, 'complement(', '')
                    +df['start'].astype(str)+'..'+df['end'].astype(str)
                    +np.where(m, ')', '')
                    )

Or boolean indexing:

m = df['strand'].eq('-')

df['coordinates'] = df['start'].astype(str)+'..'+df['end'].astype(str)

df.loc[m, 'coordinates'] = 'complement('+df.loc[m, 'coordinates']+')'

Output:

                   genome  start  end strand           coordinates
0  GUT_GENOME270877.fasta     98  396      +               98..396
1  GUT_GENOME270877.fasta    384  574      -  complement(384..574)
2  GUT_GENOME270877.fasta    593  984      +              593..984
3  GUT_GENOME270877.fasta    991  999      -  complement(991..999)
Answered By: mozway