How to compare two dataframes in pandas without for loop?
Question:
I want to compare two dataframes and find pairs of rows with the same sample
, chr
and family
and the value in pos
in just_r
dataframe should be in range between just_f pos
and just_f pos + 1000
. My solution is based on two loops with itertuples which is not effective (my data has thousands of rows and it takes so much time). Maybe someone could help me to find a more smart solution? Here is the part of my input data, expected output and my code below. Thanks a lot!
just_f
sample chr pos strand family order support comment frequency
2 NC_025812.2 9831 . Tourist|7 Tourist F - 0,562
2 NC_025812.2 12038 . Tourist|7 Tourist F - 1,000
5 NC_025812.2 12040 . Tourist|7 Tourist F - 1,000
12 NC_025812.2 12042 . Tourist|7 Tourist F - 1,000
11 NC_025812.2 30758 . uc|32 uc F - 0,547
12 NC_025812.2 49544 . uc|10 uc F - 0,112
11 NC_025812.2 56184 . hAT|9 hAT F - 0,997
5 NC_025812.2 56246 . hAT|9 hAT F - 0,756
3 NC_025812.2 56265 . hAT|9 hAT F - 1,000
12 NC_025812.2 56268 . hAT|9 hAT F - 1,000
just_r
5 NC_025812.2 12396 . Tourist|7 Tourist R - 0,975
2 NC_025812.2 12433 . Tourist|7 Tourist R - 0,935
12 NC_025812.2 12478 . Tourist|7 Tourist R - 0,887
12 NC_025812.2 28943 . Tourist|7 Tourist R - 0,610
5 NC_025812.2 28947 . Tourist|7 Tourist R - 0,490
2 NC_025812.2 51483 . Mutator|24 Mutator R - 0,422
5 NC_025812.2 56713 . hAT|9 hAT R - 0,925
11 NC_025812.2 56737 . hAT|9 hAT R - 1,000
3 NC_025812.2 56778 . hAT|9 hAT R - 0,891
12 NC_025812.2 56800 . hAT|9 hAT R - 0,965
f_r_pairs
sample chr pos strand family order support comment frequency
2 NC_025812.2 12038 . Tourist|7 Tourist F - 1.0
2 NC_025812.2 12433 . Tourist|7 Tourist R - 0.935
5 NC_025812.2 12040 . Tourist|7 Tourist F - 1.0
5 NC_025812.2 12396 . Tourist|7 Tourist R - 0.975
12 NC_025812.2 12042 . Tourist|7 Tourist F - 1.0
12 NC_025812.2 12478 . Tourist|7 Tourist R - 0.887
11 NC_025812.2 56184 . hAT|9 hAT F - 0.997
11 NC_025812.2 56737 . hAT|9 hAT R - 1.0
5 NC_025812.2 56246 . hAT|9 hAT F - 0.756
5 NC_025812.2 56713 . hAT|9 hAT R - 0.925
3 NC_025812.2 56265 . hAT|9 hAT F - 1.0
3 NC_025812.2 56778 . hAT|9 hAT R - 0.891
12 NC_025812.2 56268 . hAT|9 hAT F - 1.0
12 NC_025812.2 56800 . hAT|9 hAT R - 0.965
import pandas as pd
df_raw = pd.read_csv('1-DH-to-12-RO.NC_teinsertions.txt', sep="t", decimal=',')
df_sort = df_raw.sort_values(by=['chr', 'pos', 'sample'])
just_f = df_sort[(df_sort["support"] == 'F')]
just_r = df_sort[(df_sort["support"] == 'R')]
f_r_pairs = pd.DataFrame(columns=just_f.columns)
# choosing rows for reference TE insertions (having pairs with F and R in range 1000 bp)
for f in just_f.itertuples():
for r in just_r.itertuples():
if f.sample == r.sample and f.chr == r.chr and f.family == r.family and r.pos in range(f.pos, f.pos + 1000):
f_r_pairs = f_r_pairs.append(pd.DataFrame([f]))
f_r_pairs = f_r_pairs.append(pd.DataFrame([r]))
Answers:
You can join the two dataframes based on the matching keys, then filter for the rows that satisfy the pos
condition.
There are 2 functions that you can use: join
and merge
. merge
is the more flexible one:
f_r_pairts = (
just_f.merge(just_r, on=["sample", "chr", "family"], suffixes=("_f", "_r"))
.query("pos_f <= pos_r <= pos_f + 1000")
)
I want to compare two dataframes and find pairs of rows with the same sample
, chr
and family
and the value in pos
in just_r
dataframe should be in range between just_f pos
and just_f pos + 1000
. My solution is based on two loops with itertuples which is not effective (my data has thousands of rows and it takes so much time). Maybe someone could help me to find a more smart solution? Here is the part of my input data, expected output and my code below. Thanks a lot!
just_f
sample chr pos strand family order support comment frequency
2 NC_025812.2 9831 . Tourist|7 Tourist F - 0,562
2 NC_025812.2 12038 . Tourist|7 Tourist F - 1,000
5 NC_025812.2 12040 . Tourist|7 Tourist F - 1,000
12 NC_025812.2 12042 . Tourist|7 Tourist F - 1,000
11 NC_025812.2 30758 . uc|32 uc F - 0,547
12 NC_025812.2 49544 . uc|10 uc F - 0,112
11 NC_025812.2 56184 . hAT|9 hAT F - 0,997
5 NC_025812.2 56246 . hAT|9 hAT F - 0,756
3 NC_025812.2 56265 . hAT|9 hAT F - 1,000
12 NC_025812.2 56268 . hAT|9 hAT F - 1,000
just_r
5 NC_025812.2 12396 . Tourist|7 Tourist R - 0,975
2 NC_025812.2 12433 . Tourist|7 Tourist R - 0,935
12 NC_025812.2 12478 . Tourist|7 Tourist R - 0,887
12 NC_025812.2 28943 . Tourist|7 Tourist R - 0,610
5 NC_025812.2 28947 . Tourist|7 Tourist R - 0,490
2 NC_025812.2 51483 . Mutator|24 Mutator R - 0,422
5 NC_025812.2 56713 . hAT|9 hAT R - 0,925
11 NC_025812.2 56737 . hAT|9 hAT R - 1,000
3 NC_025812.2 56778 . hAT|9 hAT R - 0,891
12 NC_025812.2 56800 . hAT|9 hAT R - 0,965
f_r_pairs
sample chr pos strand family order support comment frequency
2 NC_025812.2 12038 . Tourist|7 Tourist F - 1.0
2 NC_025812.2 12433 . Tourist|7 Tourist R - 0.935
5 NC_025812.2 12040 . Tourist|7 Tourist F - 1.0
5 NC_025812.2 12396 . Tourist|7 Tourist R - 0.975
12 NC_025812.2 12042 . Tourist|7 Tourist F - 1.0
12 NC_025812.2 12478 . Tourist|7 Tourist R - 0.887
11 NC_025812.2 56184 . hAT|9 hAT F - 0.997
11 NC_025812.2 56737 . hAT|9 hAT R - 1.0
5 NC_025812.2 56246 . hAT|9 hAT F - 0.756
5 NC_025812.2 56713 . hAT|9 hAT R - 0.925
3 NC_025812.2 56265 . hAT|9 hAT F - 1.0
3 NC_025812.2 56778 . hAT|9 hAT R - 0.891
12 NC_025812.2 56268 . hAT|9 hAT F - 1.0
12 NC_025812.2 56800 . hAT|9 hAT R - 0.965
import pandas as pd
df_raw = pd.read_csv('1-DH-to-12-RO.NC_teinsertions.txt', sep="t", decimal=',')
df_sort = df_raw.sort_values(by=['chr', 'pos', 'sample'])
just_f = df_sort[(df_sort["support"] == 'F')]
just_r = df_sort[(df_sort["support"] == 'R')]
f_r_pairs = pd.DataFrame(columns=just_f.columns)
# choosing rows for reference TE insertions (having pairs with F and R in range 1000 bp)
for f in just_f.itertuples():
for r in just_r.itertuples():
if f.sample == r.sample and f.chr == r.chr and f.family == r.family and r.pos in range(f.pos, f.pos + 1000):
f_r_pairs = f_r_pairs.append(pd.DataFrame([f]))
f_r_pairs = f_r_pairs.append(pd.DataFrame([r]))
You can join the two dataframes based on the matching keys, then filter for the rows that satisfy the pos
condition.
There are 2 functions that you can use: join
and merge
. merge
is the more flexible one:
f_r_pairts = (
just_f.merge(just_r, on=["sample", "chr", "family"], suffixes=("_f", "_r"))
.query("pos_f <= pos_r <= pos_f + 1000")
)