AWK else if to Python (Pandas)

Question:

I have the following data:

data = {'f_geno': ["AA", "AA", "AA", "BB", "BB", "BB", "AB", "AB", "AB"],
        'ch_geno': ["AA", "AB", "BB", "AA", "AB", "BB", "AA", "BB", "AB"],
        'freq_A': [0.50, 0.46, 0.49, 0.57, 0.55, 0.44, 0.37, 0.66, 0.46],
        'freq_B': [0.50, 0.54, 0.51, 0.43, 0.45, 0.56, 0.63, 0.34, 0.54]
        }

I already wrote a simple calculator that calculates a value for each row in AWK and prints the resulting value to $5:

   awk 'BEGIN {
        FS = OFS = ","
    }
    {
        if ($1 == "AA" && $2 == "AA") {
            $5 = (1 / $3)
        } else if ($1 == "AA" && $2 == "AB") {
            $5 = (0.5 / $3)
        } else if ($1 == "AA" && $2 == "BB") {
            $5 = (0.001)
        } else if ($1 == "BB" && $2 == "AA") {
            $5 = (0.001)
        } else if ($1 == "BB" && $2 == "AB") {
            $5 = (0.5 / $4)
        } else if ($1 == "BB" && $2 == "BB") {
            $5 = (1 / $4) 
        } else if ($1 == "AB" && $2 == "AA") {
            $5 = (0.5 / $3)
        } else if ($1 == "AB" && $2 == "BB") {
            $5 = (0.5 / $4)  
        } else {
            $5 = (($3 + $4) / (4 * $3 * $4))
        }
        
        print 
    }'

I would like to do the same as above but in Python.
Can someone help, please?

Asked By: Milos

||

Answers:

You can use .apply() on a function:

def condition(x) -> float:
    if x.f_geno == "AA" and x.ch_geno == "AA":
        return 1/x.freq_A
    if x.f_geno == "AA" and x.ch_geno == "AB" or x.f_geno == "AB" and x.ch_geno == "AA":
        return 0.5/x.freq_A
    if x.f_geno == "AA" and x.ch_geno == "BB" or x.f_geno == "BB" and x.ch_geno == "AA":
        return .001
    if x.f_geno == "BB" and x.ch_geno == "AB":
        return 0.5/x.freq_B
    if x.f_geno == "BB" and x.ch_geno == "BB" or x.f_geno == "AB" and x.ch_geno == "BB":
        return 1/x.freq_B
    return (x.freq_A + x.freq_B) / (4 * x.freq_A * x.freq_B)

df = pd.DataFrame(data=data)
df["result"] = df.apply(condition, axis=1)
print(df)

Output:

  f_geno ch_geno  freq_A  freq_B    result
0     AA      AA    0.50    0.50  2.000000
1     AA      AB    0.46    0.54  1.086957
2     AA      BB    0.49    0.51  0.001000
3     BB      AA    0.57    0.43  0.001000
4     BB      AB    0.55    0.45  1.111111
5     BB      BB    0.44    0.56  1.785714
6     AB      AA    0.37    0.63  1.351351
7     AB      BB    0.66    0.34  2.941176
8     AB      AB    0.46    0.54  1.006441
Answered By: Jason Baker

Use numpy.select with mask chains by & for bitwise AND and | for bitwise OR if performance is important:

df = pd.DataFrame(data=data)

faa = df.f_geno == "AA"
chaa = df.ch_geno == "AA"

fab = df.f_geno == "AB"
chab = df.ch_geno == "AB"

fbb = df.f_geno == "BB"
chbb = df.ch_geno == "BB"

masks = [(faa & chaa), 
         (faa & chab) | (fab & chaa),
         (faa & chbb) | (fbb & chaa),
         (fbb & chbb),
         (fbb & chab) | (fab & chbb)]

vals = [1 / df.freq_A,
        0.5 / df.freq_A,
        0.001,
        1 / df.freq_B,
        0.5 / df.freq_B]

default = (df.freq_A + df.freq_B) / (4 * df.freq_A * df.freq_B)
df["result"] = np.select(masks, vals, default=default)
print(df)

  f_geno ch_geno  freq_A  freq_B    result
0     AA      AA    0.50    0.50  2.000000
1     AA      AB    0.46    0.54  1.086957
2     AA      BB    0.49    0.51  0.001000
3     BB      AA    0.57    0.43  0.001000
4     BB      AB    0.55    0.45  1.111111
5     BB      BB    0.44    0.56  1.785714
6     AB      AA    0.37    0.63  1.351351
7     AB      BB    0.66    0.34  1.470588
8     AB      AB    0.46    0.54  1.006441

Performance with 90k rows:

#90k rows
df = pd.DataFrame(data=data)
df = pd.concat([df] * 10000, ignore_index=True)

In [98]: %timeit df["result"] = df.apply(condition, axis=1)
5.96 s ± 585 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [99]: %timeit df["result"] = np.select(masks, vals, default=default)
1.59 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.