Position frequency matrix for Pandas column with strings

Question

I have a pandas Dataframe with a column of peptide sequences and I want to know how many times each each amino acid appears at each position. I have written the following code to create the position frequency matrix:

import pandas as pd
from itertools import chain

def frequency_matrix(df):
    # Empty position frequency matrix
    freq_matrix_df = pd.DataFrame(
        columns =  sorted(set(chain.from_iterable(df.peptide_alpha))),
        index=range(df.peptide_len.max()),
    ).fillna(0)

    for _, row in df.iterrows():
      for idx, aa in enumerate(row["peptide_alpha"]):
        freq_matrix_df.loc[idx, aa] += 1
    
    return freq_matrix_df

which for the following sample DataFrame:

mini_df = pd.DataFrame(["YTEGDALDALGLKRY", 
                        "LTEIYGERLYETSY",
                        "PVEEFNELLSKY", 
                        "TVDIQNPDITSSRY", 
                        "ASDKETYELRY"], 
                       columns=["peptide_alpha"])
mini_df["peptide_len"] = mini_df["peptide_alpha"].str.len()

	peptide_alpha	peptide_len
0	YTEGDALDALGLKRY	15
1	LTEIYGERLYETSY	14
2	PVEEFNELLSKY	12
3	TVDIQNPDITSSRY	14
4	ASDKETYELRY	11

gives the following output:

	A	D	E	F	G	I	K	L	N	P	Q	R	S	T	V	Y
0	1	0	0	0	0	0	0	1	0	1	0	0	0	1	0	1
1	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	0
2	0	2	3	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	1	0	1	2	1	0	0	0	0	0	0	0	0	0
4	0	1	1	1	0	0	0	0	0	0	1	0	0	0	0	1
5	1	0	0	0	1	0	0	0	2	0	0	0	0	1	0	0
6	0	0	2	0	0	0	0	1	0	1	0	0	0	0	0	1
7	0	2	1	0	0	0	0	1	0	0	0	1	0	0	0	0
8	1	0	0	0	0	1	0	3	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	1	0	0	0	1	1	1	0	1
10	0	0	1	0	1	0	1	0	0	0	0	0	1	0	0	1
11	0	0	0	0	0	0	0	1	0	0	0	0	1	1	0	1
12	0	0	0	0	0	0	1	0	0	0	0	1	1	0	0	0
13	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	2
14	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

This works for small DataFrames but because of the for loop becomes too slow for bigger datasets. Is there a way to rewrite this in a faster/vectorized way?

Asked By: BioGeek

||

Source

Answer 1

Solution

mini_df['peptide_len'] = mini_df.peptide_len.map(lambda x: range(x))
mini_df['peptide_alpha'] = mini_df.peptide_alpha.map(list)
mini_df = mini_df.explode(["peptide_alpha", "peptide_len"])

pd.crosstab(mini_df.peptide_len, mini_df.peptide_alpha)

Performance

With the dataframe

mini_df = pd.concat([mini_df] * 10000)

On my machine, my solution solves the problem within 0.5s, whereas the solution of the OP takes 1m8.6s. Consequently, I believe that my solution can be useful for him.

Output

peptide_alpha  A  D  E  F  G  I  K  L  N  P  Q  R  S  T  V  Y
peptide_len                                                  
0              1  0  0  0  0  0  0  1  0  1  0  0  0  1  0  1
1              0  0  0  0  0  0  0  0  0  0  0  0  1  2  2  0
2              0  2  3  0  0  0  0  0  0  0  0  0  0  0  0  0
3              0  0  1  0  1  2  1  0  0  0  0  0  0  0  0  0
4              0  1  1  1  0  0  0  0  0  0  1  0  0  0  0  1
5              1  0  0  0  1  0  0  0  2  0  0  0  0  1  0  0
6              0  0  2  0  0  0  0  1  0  1  0  0  0  0  0  1
7              0  2  1  0  0  0  0  1  0  0  0  1  0  0  0  0
8              1  0  0  0  0  1  0  3  0  0  0  0  0  0  0  0
9              0  0  0  0  0  0  0  1  0  0  0  1  1  1  0  1
10             0  0  1  0  1  0  1  0  0  0  0  0  1  0  0  1
11             0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  1
12             0  0  0  0  0  0  1  0  0  0  0  1  1  0  0  0
13             0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  2
14             0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1

Answered By: PaulS

Position frequency matrix for Pandas column with strings

Question:

Answers:

	A	D	E	F	G	I	K	L	N	P	Q	R	S	T	V	Y
0	1	0	0	0	0	0	0	1	0	1	0	0	0	1	0	1
1	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	0
2	0	2	3	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	1	0	1	2	1	0	0	0	0	0	0	0	0	0
4	0	1	1	1	0	0	0	0	0	0	1	0	0	0	0	1
5	1	0	0	0	1	0	0	0	2	0	0	0	0	1	0	0
6	0	0	2	0	0	0	0	1	0	1	0	0	0	0	0	1
7	0	2	1	0	0	0	0	1	0	0	0	1	0	0	0	0
8	1	0	0	0	0	1	0	3	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	1	0	0	0	1	1	1	0	1
10	0	0	1	0	1	0	1	0	0	0	0	0	1	0	0	1
11	0	0	0	0	0	0	0	1	0	0	0	0	1	1	0	1
12	0	0	0	0	0	0	1	0	0	0	0	1	1	0	0	0
13	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	2
14	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

	A	D	E	F	G	I	K	L	N	P	Q	R	S	T	V	Y
0	1	0	0	0	0	0	0	1	0	1	0	0	0	1	0	1
1	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	0
2	0	2	3	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	1	0	1	2	1	0	0	0	0	0	0	0	0	0
4	0	1	1	1	0	0	0	0	0	0	1	0	0	0	0	1
5	1	0	0	0	1	0	0	0	2	0	0	0	0	1	0	0
6	0	0	2	0	0	0	0	1	0	1	0	0	0	0	0	1
7	0	2	1	0	0	0	0	1	0	0	0	1	0	0	0	0
8	1	0	0	0	0	1	0	3	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	1	0	0	0	1	1	1	0	1
10	0	0	1	0	1	0	1	0	0	0	0	0	1	0	0	1
11	0	0	0	0	0	0	0	1	0	0	0	0	1	1	0	1
12	0	0	0	0	0	0	1	0	0	0	0	1	1	0	0	0
13	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	2
14	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

	A	D	E	F	G	I	K	L	N	P	Q	R	S	T	V	Y
0	1	0	0	0	0	0	0	1	0	1	0	0	0	1	0	1
1	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	0
2	0	2	3	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	1	0	1	2	1	0	0	0	0	0	0	0	0	0
4	0	1	1	1	0	0	0	0	0	0	1	0	0	0	0	1
5	1	0	0	0	1	0	0	0	2	0	0	0	0	1	0	0
6	0	0	2	0	0	0	0	1	0	1	0	0	0	0	0	1
7	0	2	1	0	0	0	0	1	0	0	0	1	0	0	0	0
8	1	0	0	0	0	1	0	3	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	1	0	0	0	1	1	1	0	1
10	0	0	1	0	1	0	1	0	0	0	0	0	1	0	0	1
11	0	0	0	0	0	0	0	1	0	0	0	0	1	1	0	1
12	0	0	0	0	0	0	1	0	0	0	0	1	1	0	0	0
13	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	2
14	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1