How to transpose unique entries in a column and print values from another column

Question:

I have a file input.txt with two columns, I want to split the second column by ";" and transpose the unique entries then count and list how many matches are in column 1.

This is my tab-delimited input.txt file

Gene     Biological_Process
BALF2   metabolic process
CHD4    cell organization and biogenesis;metabolic process;regulation of biological process
TCOF1   cell organization and biogenesis;regulation of biological process;transport
TOP1    cell death;cell division;cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
BcLF1   0
BALF5   metabolic process
MTA2    cell organization and biogenesis;metabolic process;regulation of biological process
MSH6    cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus

my expected output1

Biological_Process  Gene
metabolic process   BALF2   CHD4    TOP1    BALF5   MTA2    MSH6
cell organization and biogenesis    CHD4    TCOF1   TOP1    MTA2    MSH6
regulation of biological process    CHD4    TCOF1   TOP1    MTA2    MSH6
transport   TCOF1
cell death  TOP1
cell division   TOP1
response to stimulus    TOP1    MSH6
Asked By: Ibk

||

Answers:

You’ll need to parse all the data first, e.g. start with a blank dictionary and then read each line of your file (skip line 0 if it’s a header) open your file ... iterate over each line, for every entry in columns > 0 create a dictionary key for that string with its value as the string from column = 0 using string methods like split and strip and dict[gene...] = process.... Then print/write out each .items from the dict:

input.txt

gene process
A cell org bio
B cell bio
C 0
D org

script.py

#!/usr/bin/env python

def main():

    pros = {}

    with open("input.txt", "r") as ifile:
        for line in ifile:
            cols = line.strip().split()
            if len(cols) >= 1:
                for pro in cols[1:]:
                    if pro not in pros:
                        pros[pro] = []
                    pros[pro] += [cols[0]]

    with open("output.txt", "w") as ofile:
        for key,val in pros.items():
            ofile.writelines(f'{key}t' + 't'.join(val) + 'n')

if __name__ == "__main__":
    main()

run

$ chmod +x ./script.py
$ ./script.py
$ cat ./output.txt

output.txt

process gene
cell    A       B
org     A       D
bio     A       B
0       C
Answered By: MrMattBusby
$ cat script.awk 
#! /usr/bin/awk -f 

BEGIN {
    FS = "[t;]";  # sep can be a regex
    OFS = "t"
}

NR>1 && /^[A-Z]/{  # skip header & blank lines 
    for(i=NF; i>1; i--)
        if($i)   # skip empty bio-proc
           a[$i] = a[$i] OFS $1 
}
END{
    print "Biological_Process","Gene(s)"
    for(x in a)
        print x a[x] 
}

$ ./script.awk input.dat 
Biological_Process  Gene(s)
cell death  TOP1
regulation of biological process    CHD4    TCOF1   TOP1    MTA2    MSH6
transport   TCOF1
cell division   TOP1
metabolic process   BALF2   CHD4    TOP1    BALF5   MTA2    MSH6
response to stimulus    TOP1    MSH6
cell organization and biogenesis    CHD4    TCOF1   TOP1    MTA2    MSH6
Answered By: tomc
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.