How to compare two dataframes in pandas without for loop?

Question:

I want to compare two dataframes and find pairs of rows with the same sample, chr and family and the value in pos in just_r dataframe should be in range between just_f pos and just_f pos + 1000. My solution is based on two loops with itertuples which is not effective (my data has thousands of rows and it takes so much time). Maybe someone could help me to find a more smart solution? Here is the part of my input data, expected output and my code below. Thanks a lot!

just_f

sample  chr pos strand  family  order   support comment frequency
2   NC_025812.2 9831    .   Tourist|7   Tourist F   -   0,562
2   NC_025812.2 12038   .   Tourist|7   Tourist F   -   1,000
5   NC_025812.2 12040   .   Tourist|7   Tourist F   -   1,000
12  NC_025812.2 12042   .   Tourist|7   Tourist F   -   1,000
11  NC_025812.2 30758   .   uc|32   uc  F   -   0,547
12  NC_025812.2 49544   .   uc|10   uc  F   -   0,112
11  NC_025812.2 56184   .   hAT|9   hAT F   -   0,997
5   NC_025812.2 56246   .   hAT|9   hAT F   -   0,756
3   NC_025812.2 56265   .   hAT|9   hAT F   -   1,000
12  NC_025812.2 56268   .   hAT|9   hAT F   -   1,000

just_r

5   NC_025812.2 12396   .   Tourist|7   Tourist R   -   0,975
2   NC_025812.2 12433   .   Tourist|7   Tourist R   -   0,935
12  NC_025812.2 12478   .   Tourist|7   Tourist R   -   0,887
12  NC_025812.2 28943   .   Tourist|7   Tourist R   -   0,610
5   NC_025812.2 28947   .   Tourist|7   Tourist R   -   0,490
2   NC_025812.2 51483   .   Mutator|24  Mutator R   -   0,422
5   NC_025812.2 56713   .   hAT|9   hAT R   -   0,925
11  NC_025812.2 56737   .   hAT|9   hAT R   -   1,000
3   NC_025812.2 56778   .   hAT|9   hAT R   -   0,891
12  NC_025812.2 56800   .   hAT|9   hAT R   -   0,965

f_r_pairs

sample  chr pos strand  family  order   support comment frequency
2   NC_025812.2 12038   .   Tourist|7   Tourist F   -   1.0
2   NC_025812.2 12433   .   Tourist|7   Tourist R   -   0.935
5   NC_025812.2 12040   .   Tourist|7   Tourist F   -   1.0
5   NC_025812.2 12396   .   Tourist|7   Tourist R   -   0.975
12  NC_025812.2 12042   .   Tourist|7   Tourist F   -   1.0
12  NC_025812.2 12478   .   Tourist|7   Tourist R   -   0.887
11  NC_025812.2 56184   .   hAT|9   hAT F   -   0.997
11  NC_025812.2 56737   .   hAT|9   hAT R   -   1.0
5   NC_025812.2 56246   .   hAT|9   hAT F   -   0.756
5   NC_025812.2 56713   .   hAT|9   hAT R   -   0.925
3   NC_025812.2 56265   .   hAT|9   hAT F   -   1.0
3   NC_025812.2 56778   .   hAT|9   hAT R   -   0.891
12  NC_025812.2 56268   .   hAT|9   hAT F   -   1.0
12  NC_025812.2 56800   .   hAT|9   hAT R   -   0.965
import pandas as pd

df_raw = pd.read_csv('1-DH-to-12-RO.NC_teinsertions.txt', sep="t", decimal=',')
df_sort = df_raw.sort_values(by=['chr', 'pos', 'sample'])

just_f = df_sort[(df_sort["support"] == 'F')]
just_r = df_sort[(df_sort["support"] == 'R')]

f_r_pairs = pd.DataFrame(columns=just_f.columns)

# choosing rows for reference TE insertions (having pairs with F and R in range 1000 bp)
for f in just_f.itertuples():
    for r in just_r.itertuples():
        if f.sample == r.sample and f.chr == r.chr and f.family == r.family and r.pos in range(f.pos, f.pos + 1000):
            f_r_pairs = f_r_pairs.append(pd.DataFrame([f]))
            f_r_pairs = f_r_pairs.append(pd.DataFrame([r]))
Asked By: emor

||

Answers:

You can join the two dataframes based on the matching keys, then filter for the rows that satisfy the pos condition.

There are 2 functions that you can use: join and merge. merge is the more flexible one:

f_r_pairts = (
    just_f.merge(just_r, on=["sample", "chr", "family"], suffixes=("_f", "_r"))
    .query("pos_f <= pos_r <= pos_f + 1000")
)
Answered By: Code Different
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.